Aborted request during benchmarking (H200-Llama 3.3 70B)

Hello,

I have 2 or 4 H200 on the same server. Here is my setup:

vLLM API server version 0.9.0.1 running on a vllm-openai docker container.
gcc 11.4.0
cmake version 4.0.2

Ubuntu 24.04.2 LTS
Driver Version: 575.51.03      

nvcc: NVIDIA (R) Cuda compiler driver
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

Here is my last vllm parameters (but I tried with many different variation and the problem is always the same):

TRANSFORMERS_OFFLINE=1
HF_DATASET_OFFLINE=1
VLLM_ENGINE_ITERATION_TIMEOUT_S=360

--task generate --tensor-parallel-size 4 --max-num-seqs 256 --max-num-batched-tokens 25536 --gpu-memory-utilization 0.90 --enable-chunked-prefill --enable-prefix-caching --distributed-executor-backend ray --max-model-len 70000 --dtype bfloat16 --swap-space 141

The VLLM is running fine for small testing purpose for a month or so but when now that I try to increase the load. I have been experiencing aborted request (INFO [async_llm.py:420] Aborted request chatcmpl-XXX) during my vllm custom real world benchmarking.

Before that, I ran your vllm benchmark script here was my result:

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  54.31
Total input tokens:                      215196
Total generated tokens:                  188229
Request throughput (req/s):              18.41
Output token throughput (tok/s):         3466.05
Total Token throughput (tok/s):          7428.68
--------------Time to First Token---------------
Mean TTFT (ms):                          4846.07
Median TTFT (ms):                        3767.26
P99 TTFT (ms):                           12372.16
----Time per Output Token (excl. 1st token)-----
Mean TPOT (ms):                          274.80
Median TPOT (ms):                        155.41
P99 TPOT (ms):                           1012.16
--------------Inter-token Latency---------------

After that I made a script that make request to my LLM in a parallel way to benchmark myself a real world scenario. It’s a script that make X request every Y times (even if previous question is not answered) of Z tokens.

INFO 07-01 07:10:20 [loggers.py:116] Engine 000: Avg prompt throughput: 9529.8 tokens/s, Avg generation throughput: 7.7 tokens/s, Running: 21 reqs, Waiting: 39 reqs, GPU KV cache usage: 58.1%, Prefix cache hit rate: 1.2%
INFO 07-01 07:10:30 [loggers.py:116] Engine 000: Avg prompt throughput: 12692.3 tokens/s, Avg generation throughput: 11.4 tokens/s, Running: 25 reqs, Waiting: 35 reqs, GPU KV cache usage: 69.3%, Prefix cache hit rate: 1.2%
INFO 07-01 07:10:40 [loggers.py:116] Engine 000: Avg prompt throughput: 12713.0 tokens/s, Avg generation throughput: 10.6 tokens/s, Running: 29 reqs, Waiting: 31 reqs, GPU KV cache usage: 78.2%, Prefix cache hit rate: 1.2%
INFO 07-01 07:10:42 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:10:42 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:10:42 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:10:42 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:10:50 [loggers.py:116] Engine 000: Avg prompt throughput: 6342.3 tokens/s, Avg generation throughput: 14.3 tokens/s, Running: 31 reqs, Waiting: 27 reqs, GPU KV cache usage: 84.5%, Prefix cache hit rate: 1.1%
INFO 07-01 07:10:51 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:10:51 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:10:51 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:10:51 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:10:52 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:10:52 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:10:52 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:10:52 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:10:52 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:10:52 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:10:52 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:10:52 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:10:52 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:10:52 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:10:52 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:10:52 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:00 [loggers.py:116] Engine 000: Avg prompt throughput: 6371.3 tokens/s, Avg generation throughput: 15.3 tokens/s, Running: 33 reqs, Waiting: 17 reqs, GPU KV cache usage: 90.1%, Prefix cache hit rate: 1.1%
INFO 07-01 07:11:01 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:01 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:01 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:01 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:02 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:02 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:02 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:02 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:02 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:02 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:02 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:02 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:02 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:02 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:02 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:02 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:10 [loggers.py:116] Engine 000: Avg prompt throughput: 6334.7 tokens/s, Avg generation throughput: 13.1 tokens/s, Running: 35 reqs, Waiting: 7 reqs, GPU KV cache usage: 95.7%, Prefix cache hit rate: 1.0%
INFO 07-01 07:11:12 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:12 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:12 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:12 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:12 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:12 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:12 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:12 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:12 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:12 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:12 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:12 [async_llm.py:327] Request chatcmpl-aborted.
INFO 07-01 07:11:12 [async_llm.py:420] Aborted request chatcmpl-.
INFO 07-01 07:11:12 [async_llm.py:327] Request chatcmpl-aborted.

Is it normal ?
I guess the cache gets full and the scheduler reject (abort) the request. Is there a way to avoid that even accepting longer response time ?
Does puting a VLLM Production stack on top would help ? To have KV Cache Offloading ?

Does anyone have advice on config/setup for a 70B on multiple h200 ?

Thank you for your help.

Aborted requests in vLLM under high load are typically due to insufficient GPU KV cache space, especially with large models and long context lengths. When the KV cache is nearly full, vLLM may abort or preempt requests to avoid out-of-memory errors, as seen in your logs and confirmed by the documentation and user reports. This is expected behavior when the system cannot accommodate more concurrent or long-running requests given the current configuration and hardware limits. You can reduce the frequency of aborted requests by lowering max_model_len, max_num_seqs, or max_num_batched_tokens, or by increasing gpu_memory_utilization (but not too close to 1.0, as this risks OOM), or by using more GPUs and increasing tensor_parallel_size to provide more KV cache memory. However, there is no built-in mechanism to queue requests indefinitely once the cache is full—requests may still be aborted if resources are exhausted, even if you are willing to accept higher latency.

KV cache offloading (e.g., to CPU or disk) is not natively supported in vLLM for production use as of v0.9.x, though experimental integrations like LMCache exist, but require additional setup and may impact performance. For 70B models on multiple H200s, best practices are: use tensor parallelism to shard the model, keep gpu_memory_utilization below 0.98, and tune max_model_len and batch sizes to fit within available memory. If you need to support more concurrent or longer requests, consider reducing context length or batch size, or explore distributed setups with more GPUs. For more details or step-by-step tuning, let me know.

Sources: