V1 can only handle a very limited number of requests

Ysgg1 · November 5, 2025, 8:21am

When Benchmark V1 managed to fill up the GPU memory with over 100 requests. Why can’t it be reused? It doesn’t seem to be a parameter issue; the code needs to be changed. With the same parameters, V0 can run normally and complete the task.

RunLLM · November 5, 2025, 8:21am

In vLLM V1, GPU memory may not be fully reclaimed or reused after being filled by many requests, leading to issues where subsequent requests cannot proceed, even though V0 with the same parameters works fine. This is a known issue and is not due to user-exposed parameters, but rather differences in memory management and scheduling between V1 and V0. V1’s new architecture introduces changes in how memory is allocated and released, and some users have reported that memory is not reclaimed as expected after heavy load, requiring code-level changes or fixes in vLLM itself to resolve the problem. V0 does not exhibit this issue under the same conditions, indicating a regression or architectural difference in V1’s memory handling logic, not a misconfiguration by the user (vLLM V1 Feedback Thread, Memory Leak Issue in Load Testing Scenario).

Would you like a step-by-step breakdown of the technical reasons for this difference and possible workarounds?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
V1 GPU not free gpu memory General	2	125	November 5, 2025
VLLM 0.10.1 benchmark do not free memory Benchmarking	13	215	November 10, 2025
When running the benchmark with vllm 0.10.1 V1, the GPU usage was very high General	1	184	November 5, 2025
GLM4.5 V memory leak on inference V1 Feedback	9	589	August 28, 2025
Run vLLM on two diffrent GPU General	1	1052	May 21, 2025

V1 can only handle a very limited number of requests

Related topics