@RunLLM are there any improvements on the support for gpt-oss models since 0.11.0 until now? for example, whether it can support response_format now
Yes, there have been improvements in GPT-OSS support since vLLM 0.11.0, including better handling of tool calling, Harmony format, and structured outputs. However, there are still known issues with response_format: {"type": "json_object"} in multi-turn chat completions for GPT-OSS models—this mode can return content: null or invalid JSON, making it unsafe for production in multi-turn scenarios. Single-turn and non-JSON modes work as intended, and some users report that downgrading to vLLM 0.10.1 can “fix” the multi-turn JSON bug. The /v1/responses endpoint is recommended for tool use and structured output, as it leverages the Harmony format and is more robust for GPT-OSS models. Ongoing work and recent PRs continue to address these issues, but as of vLLM 0.14.0, the multi-turn JSON bug remains unresolved for chat completions with GPT-OSS models according to user reports and open issues[^1][^2][^3].
Would you like more detail on specific improvements or workarounds?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM give me details on what is improved?
Key improvements for GPT-OSS support in vLLM since 0.11.0 include:
-
Harmony Format & Tool Calling: Major bugfixes and enhancements were made to Harmony format parsing and tool calling, especially for multi-turn and streaming scenarios. The Chat Completion API and tool call parser were improved to better support GPT-OSS’s built-in tools and reasoning channels. Streaming tool call handling was also fixed to activate the correct Harmony channels, reducing tool call extraction errors (PR #24768).
-
Structured Outputs & Response Format: Structured output support (e.g., JSON, regex, grammar) was expanded and stabilized, including validation and backend improvements (structured outputs docs). However, as noted, multi-turn chat completions with
response_format: {"type": "json_object"}still have unresolved bugs, but single-turn and /v1/responses endpoint usage are more robust (issue #32791, structured outputs example). -
MCP Tool Integration: vLLM now supports integration with arbitrary MCP servers, allowing GPT-OSS to call custom tools beyond built-in ones (issue #26703).
-
General Stability & Parsing: Numerous bugfixes addressed Harmony parsing errors, tool call leakage, and improved multi-turn conversation handling (PR #28729, issue #23567).
Let me know if you want step-by-step details or code examples for any of these improvements.
Sources:
- https://github.com/vllm-project/vllm/pull/22342
- https://github.com/vllm-project/vllm/pull/24768
- https://github.com/vllm-project/vllm/pull/28729
- Bug: chat.completions returns content: null for GPT-OSS multi-turn with json_object
- Structured Outputs - vLLM Docs
- https://github.com/vllm-project/vllm/issues/26703
- Bug: openai_harmony.HarmonyError: unexpected tokens remaining in message header
Hint: Mention @RunLLM in the post for followups.