Beam search performance in vLLM is highly sensitive to beam size due to how candidates are generated and managed. At each step, vLLM generates 2 × beam_width candidates, which increases computation and memory usage as beam size grows. Larger beam sizes result in more BeamSearchSequence objects and sorting overhead, leading to slower inference, especially in recent vLLM versions where beam search is implemented outside the core engine. Enabling prefix caching and setting enforce_eager=True can help, but performance still drops significantly with higher beam widths compared to earlier versions or greedy search (source, source, source).
Would you like a detailed breakdown of the technical reasons and possible optimizations?
Sources:
- https://github.com/vllm-project/vllm/issues/14426
- https://github.com/vllm-project/vllm/pull/8726
- https://docs.vllm.ai/en/latest/api/vllm/entrypoints/llm/#__span-0-700
Hint: Mention @RunLLM in the post for followups.