Vllm 0.10.1 v1 benchmark stuck

The vllm version is 0.10.1. The v0 benchmark can be successfully executed, but the v1 request will get stuck. When the client gets stuck, it will time out. The benchmark log is provided below.

Traceback (most recent call last):
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client_reqrep.py", line 539, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/streams.py", line 680, in read
    await self._waiter
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/ysgg1/vllm_split/scripts/benchmark/backend_request_func.py", line 188, in async_request_openai
    async with session.post(url=api_url, json=payload, headers=headers) as response:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client.py", line 1517, in __aenter__
    self._resp: _RetType = await self._coro
                           ^^^^^^^^^^^^^^^^
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client.py", line 786, in _request
    resp = await handler(req)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client.py", line 764, in _connect_and_send_request
    await resp.start(conn)
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client_reqrep.py", line 534, in start
    with self._timer:
         ^^^^^^^^^^^
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/helpers.py", line 713, in __exit__
    raise asyncio.TimeoutError from exc_val
TimeoutError

[5m54s < 40m35s] Progress: 127/1000 (1 failed) (12.7%), Decoding: 0, Prefilling: 38, TTFT: 0.00 (15779.53), ITL: 21.37 (21.37), Decoding throughput: 0.00 (273.82), Prefilling throughput: 0.00 (67.14)

The log of the scheduler is below.

finish serving request: deaec2cd-7a00-4d3c-9c6e-c41024ec6b85
connection of request: 289076dc-e996-4cbd-ad6e-8c5ea911e404, scheduler request: e8420178-1460-45d3-9ea0-3c85e27a2f75 closed without finish
finish prefill stage of request[abort] or some wrong with input parameter : e8420178-1460-45d3-9ea0-3c85e27a2f75
finish serving request: e8420178-1460-45d3-9ea0-3c85e27a2f75
http: proxy error: context canceled

I noticed that when the GPU KV cache occupancy rate of a service reached 99%, requests would get stuck.

[loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.2 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 99.1%, Prefix cache hit rate: 92.1%

When using vLLM v1, if the GPU KV cache usage reaches near 99%, new requests can get stuck or time out, as the engine cannot allocate enough KV cache for additional requests. This is a known issue: when KV cache is full, the scheduler may not preempt or recompute requests as expected, causing the system to hang and client requests to time out. This behavior is documented in multiple reports, and reducing max_num_seqs or max_num_batched_tokens can sometimes mitigate the issue, but it is not a complete solution. The v0 engine handles this more gracefully, but v1 may hang or silently ignore requests when cache is exhausted or configuration is invalid, especially if max-num-batched-tokens is smaller than max-model-len or if the batch size is too large for available memory. See the discussion in vLLM server hangs and timeouts after initial requests and System gets stuck during pressure test.

To avoid this, try lowering max_num_seqs, max_num_batched_tokens, or increasing gpu_memory_utilization (but not too close to 1.0 to avoid OOM). Also, monitor for preemption warnings and consider tuning parallelism parameters. If the problem persists, you may need to restart the server or revert to v0 engine for stability. For more details on tuning and preemption, refer to Optimization and Tuning.

Would you like a step-by-step breakdown of how to adjust these parameters or further troubleshooting steps?

Sources:

Hint: Mention @RunLLM in the post for followups.