Maximum Beam Width Limitations in vLLM Beam Search

chj · June 6, 2025, 6:57am

What are the system-enforced constraints on beam search width in vLLM? Specifically:

If yes, what is the default cap? (References indicate max_beam_width=8 by default)

How does the max_num_beams parameter (default: 16) limit the total beams across concurrent requests?
Are these limits configurable at runtime? For example:

python

复制

# Can we override this during initialization?
llm = LLM(model="meta-llama/Llama-3-8B", max_beam_width=16)

or via CLI flags:

bash

复制

python -m vllm.entrypoints.api_server --max-beam-width 16

What errors occur when exceeding these limits? (e.g., ValueError: Beam width {X} exceeds system limit )

Topic		Replies	Views
Vllm Latency(TTFT) Spike on load Tesing General	9	62	August 16, 2025
Maximum batch size with Pipeline Parallelism General	3	44	June 17, 2025
The service carrying capacity and computing efficiency issues of vllm General	0	16	July 16, 2025
vLLM V1 - Default max CUDA graph size V1 Feedback	2	712	June 30, 2025
Aborted request during benchmarking (H200-Llama 3.3 70B) General	1	183	July 2, 2025