General questions on structured output backend

hustxiayang · August 20, 2025, 6:21am

@RunLLM Could you please elaborate the efficiency comparison between different structured output backend(outlines, xgrammar, guidance), and also the features support difference? Thanks!

RunLLM · August 20, 2025, 6:22am

Efficiency:
xgrammar and guidance are significantly faster than outlines for structured output generation. Benchmarks show xgrammar achieves much lower first-token latency and higher throughput than outlines, especially for grammar and JSON schema tasks. For example, xgrammar can deliver ~2.12 requests/s and ~1726 tokens/s, while outlines may be an order of magnitude slower in first-token latency and throughput. guidance (llguidance) is also reported to be faster than outlines and competitive with or faster than xgrammar, especially for JSON schema, due to near-zero compilation time and efficient mask generation. However, on some hardware (e.g., RTX 4090), xgrammar may introduce overhead due to CPU–GPU synchronization, making outlines faster in those cases. See vLLM PR #10785, vLLM PR #12388, and vLLM Issue #18517.

Feature support:
outlines supports a wide range of grammar formats (including lark), regex, and JSON schema, and is the most mature in terms of feature completeness. xgrammar is now the default for many structured output tasks and supports GBNF grammar, JSON schema (including enum, int/number ranges, and regex patterns), and is rapidly closing feature gaps, but some advanced features or non-x86 architectures may still fall back to outlines. guidance (llguidance) supports regex, choice, JSON schema, and grammar, with broad JSON schema support and fast mask generation, but some options (like disable_additional_properties) are only available in guidance. See vLLM structured outputs docs, vLLM config reference, and vLLM Issue #12131.

Would you like a detailed breakdown of specific benchmarks or feature matrices?

Sources:

hustxiayang · August 20, 2025, 6:23am

Yeah. Also, why we observed that it can be much slower with structured output in vLLM? Do you have any idea as I generally feel constrained decoding’s overhead could be hidden?

RunLLM · August 20, 2025, 6:24am

Structured output (constrained decoding) in vLLM can be much slower mainly due to CPU–GPU synchronization overhead, inefficient data structure conversions (e.g., Python lists to tensors), and serialization of grammar compilation or FSM (finite state machine) advancement, especially in older or outlines-based backends. Profiling shows that most of the overhead comes from converting allowed token lists to tensors and from blocking operations that cannot be parallelized with GPU computation, which prevents hiding the overhead behind GPU work. This is particularly pronounced with complex grammars or large JSON schemas, and on hardware with lower concurrency (e.g., RTX 4090), where CPU bottlenecks are not masked by GPU parallelism. See vLLM Issue #3567, vLLM Issue #18517, and vLLM PR #10785.

Recent improvements (e.g., xgrammar, guidance, and PRs optimizing cloning and caching) have reduced some of this overhead, but not all sources are eliminated, especially for batch or high-concurrency workloads. True async or parallel logit processing is still a work in progress, and some backends (like outlines) are inherently slower due to their architecture. See vLLM Roadmap Q4 2024 and vLLM Issue #4498.

Would you like more detail on specific bottlenecks or mitigation strategies?

Sources:

hustxiayang · August 20, 2025, 6:26am

Yes, please share specific bottlenecks and mitigation strategies.

RunLLM · August 20, 2025, 6:26am

Key bottlenecks for structured output in vLLM include:

CPU–GPU Synchronization: Converting allowed token lists (from FSM/grammar) to tensors is slow, especially when done on the CPU and then transferred to the GPU each step. This dominates latency in outlines and early xgrammar implementations (vLLM Issue #3567).
Grammar Compilation Overhead: Compiling grammars or JSON schemas for each request, rather than caching compiled FSMs, increases first-token latency and reduces throughput (vLLM PR #10785).
Blocking/Serialized Logit Processing: Logit processors for constrained decoding often run serially and block the main inference loop, preventing overlap with GPU computation (vLLM PR #12388).
Inefficient Data Structures: Use of Python lists and repeated conversions to tensors, rather than using bitmasks or sparse representations, adds overhead (vLLM Issue #3567).
Lack of Parallelism: FSM advancement and mask filling are not always parallelized with model decoding, so their cost is not hidden (vLLM PR #10785).

Mitigation strategies:

Use xgrammar or guidance backends: These are much faster than outlines, with xgrammar supporting caching and parallel compilation, and guidance offering near-zero compilation time (vLLM PR #10785, vLLM PR #12388).
Cache Compiled Grammars: Avoid recompiling FSMs for repeated requests by caching compiled grammars and tokenizer info (vLLM PR #10785).
Bitmask Representations: Use bitmask tensors for allowed tokens to speed up logit masking (vLLM PR #12388).
Async/Parallel Logit Processing: Move logit processing and FSM advancement off the critical path, possibly to separate threads or processes (vLLM PR #12388).
Batching and Overlap: Batch grammar/FSM operations and overlap them with model computation where possible (vLLM PR #10785).
Hardware Considerations: On lower-concurrency GPUs (e.g., RTX 4090), outlines may be faster due to xgrammar’s sync overhead; choose backend accordingly (vLLM Issue #18517).

Sources:

hustxiayang · September 3, 2025, 12:56am

How does vllm select the decoding backend if it ‘s not set or it’s auto?

RunLLM · September 3, 2025, 12:57am

If the decoding backend is not set or set to auto, vLLM will automatically select the backend based on the request content and the capabilities of the available backend libraries. The selection is “opinionated” and may change between releases, aiming to use the most suitable and performant backend for the specific structured output request (e.g., xgrammar for JSON schema, outlines for certain grammars, etc.) (engine_args docs, structured outputs docs).

You can override this behavior by explicitly setting --guided-decoding-backend to a specific backend (e.g., xgrammar, outlines, guidance, lm-format-enforcer) at engine startup. Per-request backend selection is not supported in vLLM V1 (engine_args docs, structured outputs docs).

Sources:

hustxiayang · September 3, 2025, 12:59am

Please show me the codes on where vllm makes such decisions?

RunLLM · September 3, 2025, 12:59am

The backend selection logic is implemented in the _validate_structured_output method in vLLM’s processor code. When the backend is set to “auto”, vLLM inspects the request and tries to validate it with xgrammar first; if that fails, it falls back to guidance, and so on. The relevant code is in vllm/v1/engine/processor.py:

def _validate_structured_output(self, params: SamplingParams) -> None:
    if not params.guided_decoding or not self.decoding_config:
        return

    engine_level_backend = self.decoding_config.backend
    if params.guided_decoding.backend:
        if (params.guided_decoding.backend != engine_level_backend
                and not (engine_level_backend == "auto"
                         and params.guided_decoding.backend_was_auto)):
            raise ValueError(
                "Request-level structured output backend selection is no "
                "longer supported. The request specified "
                f"'{params.guided_decoding.backend}', but vLLM was "
                f"initialised with '{engine_level_backend}'. This error "
                "can be resolved by removing backend selection from the "
                "request.")
    else:
        params.guided_decoding.backend = engine_level_backend

    if engine_level_backend.startswith("xgrammar"):
        validate_xgrammar_grammar(params)
    elif engine_level_backend.startswith("guidance"):
        validate_guidance_grammar(params, tokenizer=None)
    elif engine_level_backend == "outlines":
        validate_structured_output_request_outlines(params)
    else:
        try:
            validate_xgrammar_grammar(params)
            params.guided_decoding.backend = "xgrammar"
        except ValueError:
            validate_guidance_grammar(params, tokenizer=None)
            params.guided_decoding.backend = "guidance"
        params.guided_decoding.backend_was_auto = True

Sources:

vllm/v1/engine/processor.py

Topic		Replies	Views
Grammar CPU bound performance Structured Outputs	9	573	April 29, 2025
How to get structured outputs in vllm? General	12	504	December 22, 2025
Using guided decoding for JSON General	1	711	June 26, 2025
Does vllm automatically inject schema's information into the prompt? General	3	122	December 23, 2025
Speeding up vllm inference for Qwen2.5-VL General	23	7641	June 27, 2025

General questions on structured output backend

Related topics