V1 can only handle a very limited number of requests

When Benchmark V1 managed to fill up the GPU memory with over 100 requests. Why can’t it be reused? It doesn’t seem to be a parameter issue; the code needs to be changed. With the same parameters, V0 can run normally and complete the task.

In vLLM V1, GPU memory may not be fully reclaimed or reused after being filled by many requests, leading to issues where subsequent requests cannot proceed, even though V0 with the same parameters works fine. This is a known issue and is not due to user-exposed parameters, but rather differences in memory management and scheduling between V1 and V0. V1’s new architecture introduces changes in how memory is allocated and released, and some users have reported that memory is not reclaimed as expected after heavy load, requiring code-level changes or fixes in vLLM itself to resolve the problem. V0 does not exhibit this issue under the same conditions, indicating a regression or architectural difference in V1’s memory handling logic, not a misconfiguration by the user (vLLM V1 Feedback Thread, Memory Leak Issue in Load Testing Scenario).

Would you like a step-by-step breakdown of the technical reasons for this difference and possible workarounds?

Sources:

Hint: Mention @RunLLM in the post for followups.