@RunLLM Could you please elaborate the efficiency comparison between different structured output backend(outlines, xgrammar, guidance), and also the features support difference? Thanks!
Efficiency:
xgrammar and guidance are significantly faster than outlines for structured output generation. Benchmarks show xgrammar achieves much lower first-token latency and higher throughput than outlines, especially for grammar and JSON schema tasks. For example, xgrammar can deliver ~2.12 requests/s and ~1726 tokens/s, while outlines may be an order of magnitude slower in first-token latency and throughput. guidance (llguidance) is also reported to be faster than outlines and competitive with or faster than xgrammar, especially for JSON schema, due to near-zero compilation time and efficient mask generation. However, on some hardware (e.g., RTX 4090), xgrammar may introduce overhead due to CPUâGPU synchronization, making outlines faster in those cases. See vLLM PR #10785, vLLM PR #12388, and vLLM Issue #18517.
Feature support:
outlines supports a wide range of grammar formats (including lark), regex, and JSON schema, and is the most mature in terms of feature completeness. xgrammar is now the default for many structured output tasks and supports GBNF grammar, JSON schema (including enum, int/number ranges, and regex patterns), and is rapidly closing feature gaps, but some advanced features or non-x86 architectures may still fall back to outlines. guidance (llguidance) supports regex, choice, JSON schema, and grammar, with broad JSON schema support and fast mask generation, but some options (like disable_additional_properties) are only available in guidance. See vLLM structured outputs docs, vLLM config reference, and vLLM Issue #12131.
Would you like a detailed breakdown of specific benchmarks or feature matrices?
Sources:
Yeah. Also, why we observed that it can be much slower with structured output in vLLM? Do you have any idea as I generally feel constrained decodingâs overhead could be hidden?
Structured output (constrained decoding) in vLLM can be much slower mainly due to CPUâGPU synchronization overhead, inefficient data structure conversions (e.g., Python lists to tensors), and serialization of grammar compilation or FSM (finite state machine) advancement, especially in older or outlines-based backends. Profiling shows that most of the overhead comes from converting allowed token lists to tensors and from blocking operations that cannot be parallelized with GPU computation, which prevents hiding the overhead behind GPU work. This is particularly pronounced with complex grammars or large JSON schemas, and on hardware with lower concurrency (e.g., RTX 4090), where CPU bottlenecks are not masked by GPU parallelism. See vLLM Issue #3567, vLLM Issue #18517, and vLLM PR #10785.
Recent improvements (e.g., xgrammar, guidance, and PRs optimizing cloning and caching) have reduced some of this overhead, but not all sources are eliminated, especially for batch or high-concurrency workloads. True async or parallel logit processing is still a work in progress, and some backends (like outlines) are inherently slower due to their architecture. See vLLM Roadmap Q4 2024 and vLLM Issue #4498.
Would you like more detail on specific bottlenecks or mitigation strategies?
Sources:
Yes, please share specific bottlenecks and mitigation strategies.
Key bottlenecks for structured output in vLLM include:
- CPUâGPU Synchronization: Converting allowed token lists (from FSM/grammar) to tensors is slow, especially when done on the CPU and then transferred to the GPU each step. This dominates latency in outlines and early xgrammar implementations (vLLM Issue #3567).
- Grammar Compilation Overhead: Compiling grammars or JSON schemas for each request, rather than caching compiled FSMs, increases first-token latency and reduces throughput (vLLM PR #10785).
- Blocking/Serialized Logit Processing: Logit processors for constrained decoding often run serially and block the main inference loop, preventing overlap with GPU computation (vLLM PR #12388).
- Inefficient Data Structures: Use of Python lists and repeated conversions to tensors, rather than using bitmasks or sparse representations, adds overhead (vLLM Issue #3567).
- Lack of Parallelism: FSM advancement and mask filling are not always parallelized with model decoding, so their cost is not hidden (vLLM PR #10785).
Mitigation strategies:
- Use xgrammar or guidance backends: These are much faster than outlines, with xgrammar supporting caching and parallel compilation, and guidance offering near-zero compilation time (vLLM PR #10785, vLLM PR #12388).
- Cache Compiled Grammars: Avoid recompiling FSMs for repeated requests by caching compiled grammars and tokenizer info (vLLM PR #10785).
- Bitmask Representations: Use bitmask tensors for allowed tokens to speed up logit masking (vLLM PR #12388).
- Async/Parallel Logit Processing: Move logit processing and FSM advancement off the critical path, possibly to separate threads or processes (vLLM PR #12388).
- Batching and Overlap: Batch grammar/FSM operations and overlap them with model computation where possible (vLLM PR #10785).
- Hardware Considerations: On lower-concurrency GPUs (e.g., RTX 4090), outlines may be faster due to xgrammarâs sync overhead; choose backend accordingly (vLLM Issue #18517).
Sources:
How does vllm select the decoding backend if it âs not set or itâs auto?
If the decoding backend is not set or set to auto, vLLM will automatically select the backend based on the request content and the capabilities of the available backend libraries. The selection is âopinionatedâ and may change between releases, aiming to use the most suitable and performant backend for the specific structured output request (e.g., xgrammar for JSON schema, outlines for certain grammars, etc.) (engine_args docs, structured outputs docs).
You can override this behavior by explicitly setting --guided-decoding-backend to a specific backend (e.g., xgrammar, outlines, guidance, lm-format-enforcer) at engine startup. Per-request backend selection is not supported in vLLM V1 (engine_args docs, structured outputs docs).
Sources:
Please show me the codes on where vllm makes such decisions?
The backend selection logic is implemented in the _validate_structured_output method in vLLMâs processor code. When the backend is set to âautoâ, vLLM inspects the request and tries to validate it with xgrammar first; if that fails, it falls back to guidance, and so on. The relevant code is in vllm/v1/engine/processor.py:
def _validate_structured_output(self, params: SamplingParams) -> None:
if not params.guided_decoding or not self.decoding_config:
return
engine_level_backend = self.decoding_config.backend
if params.guided_decoding.backend:
if (params.guided_decoding.backend != engine_level_backend
and not (engine_level_backend == "auto"
and params.guided_decoding.backend_was_auto)):
raise ValueError(
"Request-level structured output backend selection is no "
"longer supported. The request specified "
f"'{params.guided_decoding.backend}', but vLLM was "
f"initialised with '{engine_level_backend}'. This error "
"can be resolved by removing backend selection from the "
"request.")
else:
params.guided_decoding.backend = engine_level_backend
if engine_level_backend.startswith("xgrammar"):
validate_xgrammar_grammar(params)
elif engine_level_backend.startswith("guidance"):
validate_guidance_grammar(params, tokenizer=None)
elif engine_level_backend == "outlines":
validate_structured_output_request_outlines(params)
else:
try:
validate_xgrammar_grammar(params)
params.guided_decoding.backend = "xgrammar"
except ValueError:
validate_guidance_grammar(params, tokenizer=None)
params.guided_decoding.backend = "guidance"
params.guided_decoding.backend_was_auto = True
Sources: