Maximum Beam Width Limitations in vLLM Beam Search

What are the ​system-enforced constraints​ on beam search width in vLLM? Specifically:

  1. Is there a ​hard-coded maximum​ for beam_width per request?
  • If yes, what is the default cap? (References indicate max_beam_width=8 by default)
  1. How does the max_num_beams parameter (default: 16) limit the total beams across concurrent requests?
  2. Are these limits ​configurable at runtime​? For example:

python

复制

# Can we override this during initialization?
llm = LLM(model="meta-llama/Llama-3-8B", max_beam_width=16)

or via CLI flags:

bash

复制

python -m vllm.entrypoints.api_server --max-beam-width 16
  1. What errors occur when exceeding these limits? (e.g., ValueError: Beam width {X} exceeds system limit )