How to get structured outputs in vllm?

I wander how to get structured outputs in vllm?
I’ve searched, the openai’s official doc says there exist structured outputs and json mode, and different model have different support
I use qwen3 and how can I get structured outputs?

To get structured outputs in vLLM with Qwen3, use the OpenAI-compatible API and pass the desired constraint (e.g., JSON schema, regex, choice, or grammar) via the extra_body parameter in your request. For JSON, use extra_body={“structured_outputs”: {“json”: your_json_schema}}; for choice, use extra_body={“structured_outputs”: {“choice”: [“option1”, “option2”]}}, etc. Qwen3 models support these features by default in vLLM’s OpenAI server. See the official docs for examples and Qwen3-specific usage.

Example for JSON output:

from openai import OpenAI
from pydantic import BaseModel

class Car(BaseModel):
    brand: str
    model: str

json_schema = Car.model_json_schema()
client = OpenAI(base_url="http://localhost:8000/v1", api_key="-")
completion = client.chat.completions.create(
    model="Qwen3-4B-Instruct-2507",
    messages=[{"role": "user", "content": "Generate a JSON with the brand and model of a famous car."}],
    extra_body={"structured_outputs": {"json": json_schema}},
)
print(completion.choices[0].message.content)

Would you like more detail or examples for other constraint types?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM
but I find that openai’s doc says they use code like:

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class CalendarEvent(BaseModel):
    name: str
    date: str
    participants: list[str]

response = client.responses.parse(
    model="gpt-4o-2024-08-06",
    input=[
        {"role": "system", "content": "Extract the event information."},
        {
            "role": "user",
            "content": "Alice and Bob are going to a science fair on Friday.",
        },
    ],
    text_format=CalendarEvent,
)

event = response.output_parsed

does qwen3 support that? or which models support that?

The OpenAI client.responses.parse method with text_format=CalendarEvent is supported in vLLM for models that implement structured output, but Qwen3 support for this exact API is not explicitly documented in vLLM’s official sources. The vLLM test suite shows this works for models that support structured outputs and JSON schema, but model-specific support (like Qwen3) may vary and is not guaranteed for all models or all API variants—most examples use client.chat.completions.create with structured_outputs or response_format parameters instead. See test_structured_output_with_parse and structured outputs docs.

Would you like a step-by-step example for Qwen3 using the recommended API?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM
I’ve start the server, it has the interface like:

GET:
Health
Get Server Load Metrics
Ping
Show Available Models
Show Version
Retrieve Responses
Metrics

POST:
Ping
Tokenize
Detokenize
Create Responses
Cancel Responses
Create Chat Completion
Create Completion
Create Embedding
Create Pooling
Create Classify
Create Score
Create Score V1
Create Transcriptions
Create Translations
Do Rerank
Do Rerank V1
Do Rerank V2
Scale Elastic Ep
Is Scaling Elastic Ep
Invocations

can I use structured outputs?

Yes, you can use structured outputs via the Create Chat Completion and Create Completion POST endpoints in vLLM’s OpenAI-compatible server. Pass your structured output constraints (e.g., JSON schema, regex, choice, grammar) using the extra_body parameter as described in the structured outputs documentation. This is supported by default for compatible models.

Would you like a code example for your use case?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM
I mean, can I use client.response.parse and text_format to get structured outputs?

The client.responses.parse method with the text_format parameter is supported in vLLM for models that implement structured outputs, as shown in the vLLM test suite. However, official documentation does not guarantee this works for all models, and Qwen3 support for this exact API is not explicitly confirmed—most vLLM examples use client.chat.completions.create with structured_outputs or response_format instead. See test_structured_output_with_parse.

Would you like a step-by-step example or guidance for your specific model?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM
I’ve tried, the two kind of method are both available, json schema use interface POST /v1/responses, and client.response.parse use interface POST /v1/chat/completions
but the client.response.parse did not work well

You are correct: in vLLM, structured outputs using JSON schema are officially supported via the POST /v1/responses endpoint, and this is the recommended way. The client.responses.parse method (which typically uses POST /v1/chat/completions) may not work reliably for all models or server configurations, and this limitation is not unique to Qwen3—it’s a known gap in vLLM’s OpenAI compatibility layer, as seen in test_structured_output_with_parse and related issues.

For best results, use the /v1/responses endpoint with structured outputs for JSON schema. Would you like more detail on how to structure your request for this endpoint?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM
no, sorry, I said it the other way around
I mean, if I use code like:

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class CalendarEvent(BaseModel):
    name: str
    date: str
    participants: list[str]

response = client.responses.parse(
    model="gpt-4o-2024-08-06",
    input=[
        {"role": "system", "content": "Extract the event information."},
        {
            "role": "user",
            "content": "Alice and Bob are going to a science fair on Friday.",
        },
    ],
    text_format=CalendarEvent,
)

event = response.output_parsed

it will use interface POST /v1/responses
and if I use code like:

from openai import OpenAI
from pydantic import BaseModel

class Car(BaseModel):
    brand: str
    model: str

json_schema = Car.model_json_schema()
client = OpenAI(base_url="http://localhost:8000/v1", api_key="-")
completion = client.chat.completions.create(
    model="Qwen3-4B-Instruct-2507",
    messages=[{"role": "user", "content": "Generate a JSON with the brand and model of a famous car."}],
    extra_body={"structured_outputs": {"json": json_schema}},
)
print(completion.choices[0].message.content)

that will use interface POST /v1/chat/completions


the former worked not very well

You are correct: in vLLM, structured outputs using JSON schema are best supported via the POST /v1/chat/completions endpoint (e.g., with client.chat.completions.create and structured_outputs/extra_body). The POST /v1/responses endpoint (used by client.responses.parse) may not work as reliably for structured outputs, especially with some models like Qwen3, and this is reflected in vLLM’s own tests and user reports. The recommended approach is to use the chat completions endpoint for structured output tasks in vLLM (docs reference).

Would you like more detail on how to adapt your code for best results with vLLM?

Sources:

Hint: Mention @RunLLM in the post for followups.

I guess a better way is use this:

class Car(BaseModel):
    brand: str = Field(description="Brand name, usually same as the company name")
    model: str = Field(description="Model Name")

class Cars(BaseModel):
    car_list: list[Car]

json_schema = Car.model_json_schema()
completion = client.chat.completions.parse(
    model="qwen3-32b-bnb-4bit",
    messages=[{"role": "user", "content": "list 10 kinds of famous cars"}],
    # extra_body={"structured_outputs": {"json": Cars.model_json_schema()}},
    response_format=Cars,
)
print(completion.choices[0].message.content)