I observed significant performance improvement between v0.11.0 and v0.12.0 and would like to figure out what exactly caused it. Any pointer/hint/suggestions would be greatly appreciated!!!..
Context: we maintain a fork of v0.11.0 and cannot easily migrate to v0.12.0. It’ll be superb if we could somehow transplant the root cause of this optimization onto v0.11.0.
I was doing some benchmarking on Qwen3-4B with L40 GPU. The server launch command is like:
In long-prefix regime, we can easily observe a 3+ times improvement on throughput. This feels quite unusual, especially considering these two are only one version number away.
I then went on doing some nsight profiling. The difference is also clearly visible:
The large performance jump between vLLM v0.11.0 and v0.12.0 for workloads with massive shared prefixes is primarily due to optimizations in prefix cache memory management and garbage collection. Specifically, PR #24964 introduced a change to reuse empty block lists in the KVCacheBlocks structure, significantly reducing Python garbage collection (GC) overhead. This optimization alone led to a 3–4% throughput improvement in decode-heavy workloads and even more in prefill-heavy scenarios, as confirmed by GC profiling and benchmarks. The root cause of the improvement is the reduction in GC time spent managing empty lists, which were a major contributor to performance bottlenecks in prefix caching scenarios with long shared prefixes and high concurrency.
If you want to “transplant” this optimization to your v0.11.0 fork, focus on backporting the changes from PR #24964 and related discussions in issue #24321. These changes make KVCacheBlocks immutable and pre-allocate empty block objects for reuse, thus minimizing GC pressure. No major scheduler changes were cited as the main driver for this specific improvement in the long-prefix regime. Would you like a step-by-step summary of the code changes to port?
@RunLLM Is there any way I can profile the gc overhead in v0.11.0. I see in v0.12.0 an environment variable VLLM_GC_DEBUG exists. But in v0.11.0 it doesn’t.
It seems likely that v0.11.0 does not have the VLLM_GC_DEBUG environment variable or the built-in GC debug callback found in v0.12.0 and later. The GC profiling feature (VLLM_GC_DEBUG) was introduced after v0.11.0, as described in PR #24829 and the gc_utils documentation. Therefore, you cannot directly enable GC profiling in v0.11.0 using this variable.
If you need GC profiling in v0.11.0, you would need to manually backport the relevant code from v0.12.0 or later, specifically the logic in vllm/utils/gc_utils.py and the callback registration. Would you like a step-by-step guide on how to do this?
Based on your GC logs, the total GC time is under 40ms for both v0.11.0 and v0.12.0, which is negligible compared to the overall runtime and cannot explain a 3x throughput difference. This confirms that GC overhead is not the root cause of the performance jump you observed.
Given this, the improvement is more likely due to changes in the scheduler or prefix caching logic, not GC. Several major scheduler and prefix cache optimizations landed between v0.11.0 and v0.12.0, including more efficient handling of shared prefixes and batching, as discussed in PR #24964 and related issues. You should focus your investigation on scheduler and prefix cache batching logic, not GC.
Would you like a step-by-step guide to diff the scheduler and prefix cache code between these versions?
No, GPU Model Runner V2 is not enabled by default in v0.12.0. According to the vLLM V1 User Guide, V1 (which includes the new model runner and scheduler architecture) is enabled by default only in later versions, and you must explicitly enable or disable it using the VLLM_USE_V1 environment variable. In v0.12.0, the default is still the original (V0) engine unless you set this variable.
Would you like more detail on how to enable or check the model runner version?