VLLM 0.10.1 benchmark do not free memory

Engine 000: Avg prompt throughput: 100.0 tokens/s, Avg generation throughput: 121.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 99.2%, Prefix cache hit rate: 94.9%
[1m39s < 10m23s] Progress: 137/1000 (0 failed) (13.7%), Decoding: 1, Prefilling: 38, TTFT: 0.00 (20684.01), ITL: 22.06 (23.27), Decoding throughput: 45.32 (181.19), Prefilling throughput: 0.00 (49.64)
Traceback (most recent call last):
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client_reqrep.py", line 539, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/streams.py", line 680, in read
    await self._waiter
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/ysgg1/vllm_split/scripts/benchmark/backend_request_func.py", line 188, in async_request_openai
    async with session.post(url=api_url, json=payload, headers=headers) as response:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client.py", line 1517, in __aenter__
    self._resp: _RetType = await self._coro
                           ^^^^^^^^^^^^^^^^
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client.py", line 786, in _request
    resp = await handler(req)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client.py", line 764, in _connect_and_send_request
    await resp.start(conn)
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client_reqrep.py", line 534, in start
    with self._timer:
         ^^^^^^^^^^^
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/helpers.py", line 713, in __exit__
    raise asyncio.TimeoutError from exc_val
TimeoutError
 http: proxy error: context canceled

The utilization rate of the v0 version GPU will not be so high, and all requests can be processed normally. In version V1, when the GPU utilization rate reaches around 99%, the request will get stuck and display a timeout message. After a longer period of time, the decode server will indicate that the memory is insufficient to handle 1000 requests.

根据你的描述,vLLM V1 在 GPU KV cache 使用率接近 99% 时,请求会卡住并超时,最终 decode server 报内存不足,V0 版本则没有这个问题。这种现象主要是因为 KV cache 空间不足,导致请求被频繁 preempt 或 pending,最终触发超时或 OOM 错误。V1 版本默认采用 RECOMPUTE preemption 策略,频繁 preempt 会极大影响端到端性能和稳定性,尤其在高并发或大 batch 情况下更明显。你可以通过降低 max_num_seqs 或 max_num_batched_tokens,或适当降低 gpu_memory_utilization,来减少单批次请求数,缓解 KV cache 紧张问题。此外,建议监控 preemption 次数和 GPU 内存使用,合理调整并发和 batch 配置,避免接近 100% 的 KV cache 占用。详见官方优化建议

如果你需要高吞吐量,可以适当提升 batch size,但要确保 KV cache 有足够空间,否则会出现你描述的 stuck 和 OOM 问题。V1 的 preemption 机制虽然提升了系统鲁棒性,但在极限负载下会带来显著的延迟和失败率。建议根据实际 GPU 内存和模型大小,反复测试并调整上述参数,找到吞吐与稳定性的平衡点。详细调优方法见此

需要更详细的参数调优建议或日志分析吗?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM The parameters are not working.

http: proxy error: context canceled

Logically speaking, the utilization rate of the KV cache shouldn’t be so high.

If your KV cache utilization is unexpectedly high and parameters like gpu_memory_utilization or max_num_seqs are not working as expected, this may be due to vLLM’s memory profiling and allocation strategy: vLLM empirically measures memory usage at startup and allocates KV cache based on available memory after reserving space for model weights and activations. If you set gpu_memory_utilization too high (close to 1), there may not be enough headroom for temporary tensors, CUDA graphs, or fragmentation, causing OOM or stuck requests—even if your settings seem reasonable. This is a known limitation, and actual memory usage can exceed your expectations due to these factors. See here for details and here for parameter explanations.

If you want more precise control, try lowering gpu_memory_utilization (e.g., 0.8), reduce max_model_len, or explicitly set kv_cache_memory_bytes. Also, note that vLLM does not have a strict formula for memory allocation—it relies on runtime profiling, so results may vary across hardware and models. If the problem persists, check for memory fragmentation or CUDA graph overhead, and consider running with --enforce-eager to disable CUDA graphs for debugging. Would you like a step-by-step troubleshooting checklist?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM V0 can complete the benchmark, but V1 cannot. The script is the same, only the startup command is different.

LOG_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=5  vllm serve  /local/QwQ-32B/ --port=8011 --disable-log-requests --enable-chunked-prefill  --tensor-parallel-size=1 --max-model-len=16328 > /tp1.log 2>&1 &
LOG_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=6  vllm serve  /local/QwQ-32B/ --port=8022 --disable-log-requests --enable-chunked-prefill  --tensor-parallel-size=1 --max-model-len=16328 > /3instance_tp1.log 2>&1 &
LOG_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=7  vllm serve  /local/QwQ-32B/ --port=8033 --disable-log-requests --enable-chunked-prefill  --tensor-parallel-size=1 --max-model-len=16328 > /qwq_32b_3instance_tp1.log 2>&1 &

This is a known issue: in vLLM V1, when running benchmarks with multiple instances (e.g., 3x single-GPU servers), requests can get stuck or time out, especially when GPU utilization or KV cache usage is high. The same script works in V0 but fails in V1 due to differences in memory management and scheduling. V1’s chunked prefill and preemption mechanisms can cause requests to hang if KV cache is exhausted, and high GPU usage (99%+) is a common symptom. See this issue for details.

If you need to run the benchmark successfully, use V0 or reduce the number of concurrent requests, batch size, or max_model_len in V1. Would you like a step-by-step guide to further debug or tune V1 for your setup?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM I need to successfully run the benchmark on v1. How can I achieve this?

To successfully run the benchmark on vLLM V1, reduce KV cache pressure by lowering max_num_seqs and/or max_num_batched_tokens, and set gpu_memory_utilization to a safer value (e.g., 0.8–0.9). This prevents preemption and OOM errors that cause hangs in V1. Also, monitor preemption warnings and adjust parameters if you see frequent preemptions. See Optimization and Tuning for detailed guidance.

If you still encounter issues, try reducing max_model_len, disabling CUDA graphs with --enforce-eager, or running fewer concurrent requests. These steps help ensure V1 can complete the benchmark without stalling. Would you like a step-by-step example or troubleshooting checklist?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM I used these parameters, but they didn’t work. The same problem persists.

If lowering max_num_seqs, max_num_batched_tokens, gpu_memory_utilization, and even max_model_len or --enforce-eager do not resolve the issue in vLLM V1, this may be due to a known bug in V1’s memory management and KV cache calculation, especially for 32B+ models on single 24GB/32GB GPUs. Many users report that V1 fails to initialize or hangs under load, while V0 works with the same parameters. This is confirmed in recent GitHub issues and is not fully resolved in the latest V1 releases.

Currently, the only reliable workaround is to use V0 for your benchmark, or try much smaller models. The V1 engine’s memory management for large models is under active investigation. Would you like to see community-reported workarounds or get notified when a fix is available?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM server log

👾 prepare: Insufficient kv cache for 1000 tokens

你的日志“prepare: Insufficient kv cache for 1000 tokens”说明当前 KV cache 空间不足,无法同时为 1000 个 token 分配缓存。这通常是因为模型太大、max_model_len 设置过高,或 GPU 显存不足导致的。即使你已经调整了参数,vLLM V1 在大模型和高并发下仍可能出现此问题,这是 V1 版本的已知限制。相关讨论见此

建议尝试进一步降低 max_model_len、max_num_seqs、max_num_batched_tokens,或换用更小的模型。如果依然无法解决,建议暂时使用 V0 版本,等待 V1 的 KV cache 管理 bug 修复。需要具体参数建议或日志分析吗?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM I have already tried adjusting the parameters you mentioned above, but it didn’t work.

If you have already tried lowering max_model_len, max_num_seqs, max_num_batched_tokens, and gpu_memory_utilization but still get “Insufficient kv cache” errors in vLLM V1, this is a known unresolved issue with V1’s memory management for large models (especially 7B+ or 32B+ on 12–24GB GPUs). Many users report that V1 cannot run these models even with aggressive parameter tuning, while V0 works as expected. See this issue for confirmation and details.

At this time, there is no reliable fix for this in V1. The only stable workaround is to use V0 for your benchmarks or switch to a smaller model until the V1 engine’s KV cache bug is resolved. Would you like to track updates or see alternative workarounds?

Sources:

Hint: Mention @RunLLM in the post for followups.