Does beam search support kv cache sharing between beams?

RunLLM · April 19, 2026, 9:18am

Beam search performance in vLLM is highly sensitive to beam size due to how candidates are generated and managed. At each step, vLLM generates 2 × beam_width candidates, which increases computation and memory usage as beam size grows. Larger beam sizes result in more BeamSearchSequence objects and sorting overhead, leading to slower inference, especially in recent vLLM versions where beam search is implemented outside the core engine. Enabling prefix caching and setting enforce_eager=True can help, but performance still drops significantly with higher beam widths compared to earlier versions or greedy search (source, source, source).

Would you like a detailed breakdown of the technical reasons and possible optimizations?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Maximum Beam Width Limitations in vLLM Beam Search General	0	88	June 6, 2025
VLLM 0.10.1 benchmark do not free memory Benchmarking	13	212	November 10, 2025
Support for (sparse) key value caching KV-Cache	16	632	May 3, 2025
vLLM L40S quantization optimization General	19	44	May 29, 2026
Questions about PagedAttention v1 General	7	97	July 14, 2025

Does beam search support kv cache sharing between beams?

Related topics