Hello,
I have 2 or 4 H200 on the same server. Here is my setup:
vLLM API server version 0.9.0.1 running on a vllm-openai docker container.
gcc 11.4.0
cmake version 4.0.2
Ubuntu 24.04.2 LTS
Driver Version: 575.51.03
nvcc: NVIDIA (R) Cuda compiler driver
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0
Here is my last vllm parameters (but I tried with many different variation and the problem is always the same):
TRANSFORMERS_OFFLINE=1
HF_DATASET_OFFLINE=1
VLLM_ENGINE_ITERATION_TIMEOUT_S=360
--task generate --tensor-parallel-size 4 --max-num-seqs 256 --max-num-batched-tokens 25536 --gpu-memory-utilization 0.90 --enable-chunked-prefill --enable-prefix-caching --distributed-executor-backend ray --max-model-len 70000 --dtype bfloat16 --swap-space 141
The VLLM is running fine for small testing purpose for a month or so but when now that I try to increase the load. I have been experiencing aborted request (INFO [async_llm.py:420] Aborted request chatcmpl-XXX) during my vllm custom real world benchmarking.
Before that, I ran your vllm benchmark script here was my result:
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 54.31
Total input tokens: 215196
Total generated tokens: 188229
Request throughput (req/s): 18.41
Output token throughput (tok/s): 3466.05
Total Token throughput (tok/s): 7428.68
--------------Time to First Token---------------
Mean TTFT (ms): 4846.07
Median TTFT (ms): 3767.26
P99 TTFT (ms): 12372.16
----Time per Output Token (excl. 1st token)-----
Mean TPOT (ms): 274.80
Median TPOT (ms): 155.41
P99 TPOT (ms): 1012.16
--------------Inter-token Latency---------------
After that I made a script that make request to my LLM in a parallel way to benchmark myself a real world scenario. It’s a script that make X request every Y times (even if previous question is not answered) of Z tokens.
INFO 07-01 07:10:20 [loggers.py:116] Engine 000: Avg prompt throughput: 9529.8 tokens/s, Avg generation throughput: 7.7 tokens/s, Running: 21 reqs, Waiting: 39 reqs, GPU KV cache usage: 58.1%, Prefix cache hit rate: 1.2%
INFO 07-01 07:10:30 [loggers.py:116] Engine 000: Avg prompt throughput: 12692.3 tokens/s, Avg generation throughput: 11.4 tokens/s, Running: 25 reqs, Waiting: 35 reqs, GPU KV cache usage: 69.3%, Prefix cache hit rate: 1.2%
INFO 07-01 07:10:40 [loggers.py:116] Engine 000: Avg prompt throughput: 12713.0 tokens/s, Avg generation throughput: 10.6 tokens/s, Running: 29 reqs, Waiting: 31 reqs, GPU KV cache usage: 78.2%, Prefix cache hit rate: 1.2%
INFO 07-01 07:10:42 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:10:42 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:10:42 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:10:42 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:10:50 [loggers.py:116] Engine 000: Avg prompt throughput: 6342.3 tokens/s, Avg generation throughput: 14.3 tokens/s, Running: 31 reqs, Waiting: 27 reqs, GPU KV cache usage: 84.5%, Prefix cache hit rate: 1.1%
INFO 07-01 07:10:51 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:10:51 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:10:51 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:10:51 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:10:52 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:10:52 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:10:52 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:10:52 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:10:52 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:10:52 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:10:52 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:10:52 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:10:52 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:10:52 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:10:52 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:10:52 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:00 [loggers.py:116] Engine 000: Avg prompt throughput: 6371.3 tokens/s, Avg generation throughput: 15.3 tokens/s, Running: 33 reqs, Waiting: 17 reqs, GPU KV cache usage: 90.1%, Prefix cache hit rate: 1.1%
INFO 07-01 07:11:01 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:01 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:01 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:01 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:02 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:02 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:02 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:02 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:02 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:02 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:02 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:02 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:02 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:02 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:02 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:02 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:10 [loggers.py:116] Engine 000: Avg prompt throughput: 6334.7 tokens/s, Avg generation throughput: 13.1 tokens/s, Running: 35 reqs, Waiting: 7 reqs, GPU KV cache usage: 95.7%, Prefix cache hit rate: 1.0%
INFO 07-01 07:11:12 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:12 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:12 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:12 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:12 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:12 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:12 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:12 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:12 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:12 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:12 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:12 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:12 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:12 [async_llm.py:327] Request chatcmpl-aborted.
Is it normal ?
I guess the cache gets full and the scheduler reject (abort) the request. Is there a way to avoid that even accepting longer response time ?
Does puting a VLLM Production stack on top would help ? To have KV Cache Offloading ?
Does anyone have advice on config/setup for a 70B on multiple h200 ?
Thank you for your help.