What are the system-enforced constraints on beam search width in vLLM? Specifically:
- Is there a hard-coded maximum for
beam_width
per request?
- If yes, what is the default cap? (References indicate
max_beam_width=8
by default)
- How does the
max_num_beams
parameter (default: 16) limit the total beams across concurrent requests? - Are these limits configurable at runtime? For example:
python
复制
# Can we override this during initialization?
llm = LLM(model="meta-llama/Llama-3-8B", max_beam_width=16)
or via CLI flags:
bash
复制
python -m vllm.entrypoints.api_server --max-beam-width 16
- What errors occur when exceeding these limits? (e.g.,
ValueError: Beam width {X} exceeds system limit
)