V1 GPU not free gpu memory

Ysgg1 · November 5, 2025, 10:06am

On V1, a request was sent to the GPU, but its memory usage did not decrease and it kept increasing rapidly.

[loggers.py:123] Engine 000: Avg prompt throughput: 600.0 tokens/s, Avg generation throughput: 77.2 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.7%, Prefix cache hit rate: 99.2%

[loggers.py:123] Engine 000: Avg prompt throughput: 1199.9 tokens/s, Avg generation throughput: 204.3 tokens/s, Running: 6 reqs, Waiting: 0 reqs, GPU KV cache usage: 24.6%, Prefix cache hit rate: 99.2%

[loggers.py:123] Engine 000: Avg prompt throughput: 1598.0 tokens/s, Avg generation throughput: 190.2 tokens/s, Running: 12 reqs, Waiting: 0 reqs, GPU KV cache usage: 38.8%, Prefix cache hit rate: 90.4%

[loggers.py:123] Engine 000: Avg prompt throughput: 1498.5 tokens/s, Avg generation throughput: 361.3 tokens/s, Running: 13 reqs, Waiting: 0 reqs, GPU KV cache usage: 50.8%, Prefix cache hit rate: 89.1%

[loggers.py:123] Engine 000: Avg prompt throughput: 1699.8 tokens/s, Avg generation throughput: 404.1 tokens/s, Running: 9 reqs, Waiting: 0 reqs, GPU KV cache usage: 70.0%, Prefix cache hit rate: 90.2%

[loggers.py:123] Engine 000: Avg prompt throughput: 2297.1 tokens/s, Avg generation throughput: 422.2 tokens/s, Running: 10 reqs, Waiting: 0 reqs, GPU KV cache usage: 94.9%, Prefix cache hit rate: 92.5%

[loggers.py:123] Engine 000: Avg prompt throughput: 800.0 tokens/s, Avg generation throughput: 255.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 99.3%, Prefix cache hit rate: 93.1%

RunLLM · November 5, 2025, 10:07am

vLLM V1 allocates most available GPU memory for the KV cache, and GPU KV cache usage increases as more requests are processed or as context length grows. This is expected behavior: the cache fills up to accommodate active requests, and memory usage will not decrease until requests complete and cache blocks are freed. If the cache approaches 100%, new requests may be queued or preempted, but memory is not released until the workload drops or the engine is restarted. See docs on metrics and discussion on GPU KV cache usage.

If you observe memory usage never decreasing even after all requests finish, or if the engine hangs at high cache usage, this may indicate a bug or a stuck request, which has been reported in some cases and may require a service restart to recover (example issue). Would you like more detail on how to tune memory usage or debug stuck cache?

Sources:

Hint: Mention @RunLLM in the post for followups.

Ysgg1 · November 5, 2025, 10:14am

1000 requests. After completing over 100 requests, it would get stuck and then report a “client timeout” issue.

Traceback (most recent call last):
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client_reqrep.py", line 539, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/streams.py", line 680, in read
    await self._waiter
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/ysgg1/vllm_split/scripts/benchmark/backend_request_func.py", line 188, in async_request_openai
    async with session.post(url=api_url, json=payload, headers=headers) as response:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client.py", line 1517, in __aenter__
    self._resp: _RetType = await self._coro
                           ^^^^^^^^^^^^^^^^
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client.py", line 786, in _request
    resp = await handler(req)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client.py", line 764, in _connect_and_send_request
    await resp.start(conn)
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client_reqrep.py", line 534, in start
    with self._timer:
         ^^^^^^^^^^^
  File "/opt/ac2/lib/python3.12/site-packages/aiohttp/helpers.py", line 713, in __exit__
    raise asyncio.TimeoutError from exc_val
TimeoutError

Topic		Replies	Views
V1 can only handle a very limited number of requests General	1	55	November 5, 2025
VLLM 0.10.1 benchmark do not free memory Benchmarking	13	216	November 10, 2025
GLM4.5 V memory leak on inference V1 Feedback	9	589	August 28, 2025
When running the benchmark with vllm 0.10.1 V1, the GPU usage was very high General	1	184	November 5, 2025
How to understand OOM and foresee memory usage General	5	164	April 24, 2026

V1 GPU not free gpu memory

Related topics