Significant speedup observed with long common prefix between v0.11.0 and v0.12.0

Dear all,

I observed significant performance improvement between v0.11.0 and v0.12.0 and would like to figure out what exactly caused it. Any pointer/hint/suggestions would be greatly appreciated!!!..

Context: we maintain a fork of v0.11.0 and cannot easily migrate to v0.12.0. It’ll be superb if we could somehow transplant the root cause of this optimization onto v0.11.0.

I was doing some benchmarking on Qwen3-4B with L40 GPU. The server launch command is like:

vllm serve /models/Qwen3-4B --served-model-name Qwen3-4B --host 0.0.0.0 --port 1100

And my benchmarking command is:

evalscope perf \
  --parallel 80 \
  --number 800 \
  --model Qwen3-4B \
  --url http://localhost:1100/v1/chat/completions \
  --api openai \
  --dataset random \
  --min-tokens 10 \
  --max-tokens 10 \
  --prefix-length 1195 \
  --min-prompt-length 5 \
  --max-prompt-length 5 \
  --tokenizer-path /models/Qwen3-4B
  • Massive shared prefix (1195/1200)
  • Minimal prefill (5/1200)
  • Minimal generation (10 tokens)

The req/s difference between v0.11.0 and v0.12.0 is significant.
Namely, 45.0566 → 201.632 requests per second.

I did some plotting work by fixing prompt length as 1200 and changing the length of common prefix. The result is shown below:

In long-prefix regime, we can easily observe a 3+ times improvement on throughput. This feels quite unusual, especially considering these two are only one version number away.

I then went on doing some nsight profiling. The difference is also clearly visible:

  • for v0.11.0, lots of bubbles and waiting:

  • for v0.12.0, barely any:

As of now, I don’t have a theory on what is happening. I suspect it has to do with either:

Thanks in advance!!! And don’t hesitate to reach out if any further info is needed.

All the best.

The large performance jump between vLLM v0.11.0 and v0.12.0 for workloads with massive shared prefixes is primarily due to optimizations in prefix cache memory management and garbage collection. Specifically, PR #24964 introduced a change to reuse empty block lists in the KVCacheBlocks structure, significantly reducing Python garbage collection (GC) overhead. This optimization alone led to a 3–4% throughput improvement in decode-heavy workloads and even more in prefill-heavy scenarios, as confirmed by GC profiling and benchmarks. The root cause of the improvement is the reduction in GC time spent managing empty lists, which were a major contributor to performance bottlenecks in prefix caching scenarios with long shared prefixes and high concurrency.

If you want to “transplant” this optimization to your v0.11.0 fork, focus on backporting the changes from PR #24964 and related discussions in issue #24321. These changes make KVCacheBlocks immutable and pre-allocate empty block objects for reuse, thus minimizing GC pressure. No major scheduler changes were cited as the main driver for this specific improvement in the long-prefix regime. Would you like a step-by-step summary of the code changes to port?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Is there any way I can profile the gc overhead in v0.11.0. I see in v0.12.0 an environment variable VLLM_GC_DEBUG exists. But in v0.11.0 it doesn’t.

It seems likely that v0.11.0 does not have the VLLM_GC_DEBUG environment variable or the built-in GC debug callback found in v0.12.0 and later. The GC profiling feature (VLLM_GC_DEBUG) was introduced after v0.11.0, as described in PR #24829 and the gc_utils documentation. Therefore, you cannot directly enable GC profiling in v0.11.0 using this variable.

If you need GC profiling in v0.11.0, you would need to manually backport the relevant code from v0.12.0 or later, specifically the logic in vllm/utils/gc_utils.py and the callback registration. Would you like a step-by-step guide on how to do this?

Sources:

Hint: Mention @RunLLM in the post for followups.

I cherrypicked the gc utils logic in v0.11.0 and ran gc logging on both. The result:

I genuinely don’t believe this is the cause for the 3x performance. This looks marginal to say the least.

Could this have to do with how the scheduler works?
@RunLLM

Based on your GC logs, the total GC time is under 40ms for both v0.11.0 and v0.12.0, which is negligible compared to the overall runtime and cannot explain a 3x throughput difference. This confirms that GC overhead is not the root cause of the performance jump you observed.

Given this, the improvement is more likely due to changes in the scheduler or prefix caching logic, not GC. Several major scheduler and prefix cache optimizations landed between v0.11.0 and v0.12.0, including more efficient handling of shared prefixes and batching, as discussed in PR #24964 and related issues. You should focus your investigation on scheduler and prefix cache batching logic, not GC.

Would you like a step-by-step guide to diff the scheduler and prefix cache code between these versions?

Sources:

Hint: Mention @RunLLM in the post for followups.

Some more issues for my reference?

Also, show me how to diff them.

@RunLLM

For more issues and discussions related to scheduler and prefix cache batching logic in vLLM, see:

How to diff scheduler and prefix cache code between v0.11.0 and v0.12.0:

  1. Clone the vLLM repo and check out both tags:

    git clone https://github.com/vllm-project/vllm.git
    cd vllm
    git fetch --all --tags
    
  2. Diff the scheduler and prefix cache files:

    git diff v0.11.0 v0.12.0 -- vllm/core/scheduler.py vllm/core/kv_cache_manager.py
    

    Or, for a broader look:

    git diff v0.11.0 v0.12.0 -- vllm/core/
    
  3. Review the diffs for changes in batching, prefix cache hit logic, and scheduling order.

Would you like a summary of the most important code changes found in these diffs?

Sources:

Hint: Mention @RunLLM in the post for followups.

Is GPU model runner V2 enabled by default in v0.12.0?
@RunLLM

No, GPU Model Runner V2 is not enabled by default in v0.12.0. According to the vLLM V1 User Guide, V1 (which includes the new model runner and scheduler architecture) is enabled by default only in later versions, and you must explicitly enable or disable it using the VLLM_USE_V1 environment variable. In v0.12.0, the default is still the original (V0) engine unless you set this variable.

Would you like more detail on how to enable or check the model runner version?

Sources:

Hint: Mention @RunLLM in the post for followups.