V1 GPU not free gpu memory

On V1, a request was sent to the GPU, but its memory usage did not decrease and it kept increasing rapidly.

[loggers.py:123] Engine 000: Avg prompt throughput: 600.0 tokens/s, Avg generation throughput: 77.2 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.7%, Prefix cache hit rate: 99.2%
[loggers.py:123] Engine 000: Avg prompt throughput: 1199.9 tokens/s, Avg generation throughput: 204.3 tokens/s, Running: 6 reqs, Waiting: 0 reqs, GPU KV cache usage: 24.6%, Prefix cache hit rate: 99.2%
[loggers.py:123] Engine 000: Avg prompt throughput: 1598.0 tokens/s, Avg generation throughput: 190.2 tokens/s, Running: 12 reqs, Waiting: 0 reqs, GPU KV cache usage: 38.8%, Prefix cache hit rate: 90.4%
[loggers.py:123] Engine 000: Avg prompt throughput: 1498.5 tokens/s, Avg generation throughput: 361.3 tokens/s, Running: 13 reqs, Waiting: 0 reqs, GPU KV cache usage: 50.8%, Prefix cache hit rate: 89.1%
[loggers.py:123] Engine 000: Avg prompt throughput: 1699.8 tokens/s, Avg generation throughput: 404.1 tokens/s, Running: 9 reqs, Waiting: 0 reqs, GPU KV cache usage: 70.0%, Prefix cache hit rate: 90.2%
[loggers.py:123] Engine 000: Avg prompt throughput: 2297.1 tokens/s, Avg generation throughput: 422.2 tokens/s, Running: 10 reqs, Waiting: 0 reqs, GPU KV cache usage: 94.9%, Prefix cache hit rate: 92.5%
[loggers.py:123] Engine 000: Avg prompt throughput: 800.0 tokens/s, Avg generation throughput: 255.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 99.3%, Prefix cache hit rate: 93.1%

vLLM V1 allocates most available GPU memory for the KV cache, and GPU KV cache usage increases as more requests are processed or as context length grows. This is expected behavior: the cache fills up to accommodate active requests, and memory usage will not decrease until requests complete and cache blocks are freed. If the cache approaches 100%, new requests may be queued or preempted, but memory is not released until the workload drops or the engine is restarted. See docs on metrics and discussion on GPU KV cache usage.

If you observe memory usage never decreasing even after all requests finish, or if the engine hangs at high cache usage, this may indicate a bug or a stuck request, which has been reported in some cases and may require a service restart to recover (example issue). Would you like more detail on how to tune memory usage or debug stuck cache?

Sources:

Hint: Mention @RunLLM in the post for followups.

1000 requests. After completing over 100 requests, it would get stuck and then report a “client timeout” issue.

Traceback (most recent call last):
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client_reqrep.py", line 539, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/streams.py", line 680, in read
    await self._waiter
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/ysgg1/vllm_split/scripts/benchmark/backend_request_func.py", line 188, in async_request_openai
    async with session.post(url=api_url, json=payload, headers=headers) as response:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client.py", line 1517, in __aenter__
    self._resp: _RetType = await self._coro
                           ^^^^^^^^^^^^^^^^
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client.py", line 786, in _request
    resp = await handler(req)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client.py", line 764, in _connect_and_send_request
    await resp.start(conn)
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client_reqrep.py", line 534, in start
    with self._timer:
         ^^^^^^^^^^^
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/helpers.py", line 713, in __exit__
    raise asyncio.TimeoutError from exc_val
TimeoutError