How does `tool_choice` effect vllm?

when there is no “tool_choice":“required” in my http request, things be okay
but after I added it, it’s response’s content changed from

"message": {
                "role": "assistant",
                "content": "The weather in Boston is sunny with a temperature of 24°C.",
                "refusal": null,
                "annotations": null,
                "audio": null,
                "function_call": null,
                "tool_calls": [],
                "reasoning_content": null
            }

to

"message": {
                "role": "assistant",
                "content": "",
                "refusal": null,
                "annotations": null,
                "audio": null,
                "function_call": null,
                "tool_calls": [
                    {
                        "id": "chatcmpl-tool-327da18a717e4da98f2923a108de1d71",
                        "type": "function",
                        "function": {
                            "name": "get_weather",
                            "arguments": "{\"location\": \"Boston\"}"
                        }
                    }
                ],
                "reasoning_content": null
            }

theirs request is almost the same:

POST /v1/chat/completions HTTP/1.1
Host: 192.168.10.166:8000
Accept-Encoding: gzip, deflate, zstd
Connection: keep-alive
Accept: application/json
Content-Type: application/json
User-Agent: OpenAI/Python 2.6.1
X-Stainless-Lang: python
X-Stainless-Package-Version: 2.6.1
X-Stainless-OS: Windows
X-Stainless-Arch: other:amd64
X-Stainless-Runtime: CPython
X-Stainless-Runtime-Version: 3.10.11
Authorization: Bearer vllm-key
X-Stainless-Async: false
X-Stainless-Raw-Response: true
x-stainless-retry-count: 0
Content-Length: 702

{
    "messages": [
        {
            "content": "What's the weather in Boston?",
            "role": "user"
        },
        {
            "content": null,
            "role": "assistant",
            "tool_calls": [
                {
                    "type": "function",
                    "id": "chatcmpl-tool-933cb0158e8144f8b2f36036b315bc19",
                    "function": {
                        "name": "get_weather",
                        "arguments": "{\"location\": \"Boston\"}"
                    }
                }
            ]
        },
        {
            "content": "sunny, 24°C",
            "role": "tool",
            "tool_call_id": "chatcmpl-tool-933cb0158e8144f8b2f36036b315bc19"
        }
    ],
    "model": "qwen3-32b-bnb-4bit",
    "max_completion_tokens": 100,
    "stream": false,
    "temperature": 0.3,
    "tool_choice": "required", # this line does not exist when "tool_choice" is not specified
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get the weather at a location.",
                "parameters": {
                    "properties": {
                        "location": {
                            "type": "string"
                        }
                    },
                    "required": [
                        "location"
                    ],
                    "type": "object"
                }
            }
        }
    ]
}

When you add "tool_choice": "required" to your request, vLLM forces the model to generate one or more tool calls instead of a natural language answer. As a result, the "content" field in the assistant’s message becomes empty, and the tool calls are returned in the "tool_calls" array, matching the OpenAI API’s behavior for required tool use. Without "tool_choice": "required", the model is free to answer in plain text, so you see a regular response in "content" and no tool calls are triggered. This is expected and documented behavior in vLLM’s tool calling implementation—see Required Function Calling and Tool Calling Quickstart.

If you want the model to decide whether to use a tool or answer directly, use "tool_choice": "auto" or omit the field. If you want to always force tool calls, use "tool_choice": "required", which guarantees the output will be in the tool_calls field and not in content—this is by design and not a bug. For more, see tool_choice documentation.

Would you like more detail on how to handle the tool call outputs in your application?

Sources:

Hint: Mention @RunLLM in the post for followups.

1 Like

@RunLLM how does this parameter effect the workflow of vllm?

The "tool_choice": "required" parameter in vLLM’s chat completion API forces the model to always generate tool calls (function calls) based on the provided tools list, rather than returning a plain text answer. This changes the workflow: instead of getting a direct response in the "content" field, you must process the returned "tool_calls" array, execute the specified functions, and then (optionally) send the tool results back to the model for further reasoning or a final answer. This is different from "tool_choice": "auto" (model decides) or "none" (no tool calls allowed) as described here.

In summary, "tool_choice": "required" guarantees structured tool call outputs and shifts the workflow from simple text response handling to a multi-step process involving tool execution and possibly multiple API calls. This is the intended OpenAI-compatible behavior in vLLM’s tool calling system.

Would you like a step-by-step example of this workflow?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM will vllm add text about tool_choice into prompt to do that? or limit the output?

Yes, vLLM includes tool definitions in the prompt when tools are specified, regardless of the tool_choice value, unless you use the --exclude-tools-when-tool-choice-none flag. For "tool_choice": "required", vLLM also uses structured outputs (guided decoding) to strictly constrain the model’s output to match the tool schema, effectively limiting the output to valid tool calls only, not free-form text. This ensures the model generates tool calls as required by the API contract see docs.

So, vLLM both adds tool information to the prompt and enforces output constraints when tool_choice is set to "required".

Would you like more technical detail on how the prompt and output constraints are implemented?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM that’s not very clear, my chat template does not process tool_choice, so do u mean vllm will process the string after the chat_template applied?

Yes, that’s correct: vLLM enforces the "tool_choice" logic after the chat template is applied. The chat template formats the prompt, but vLLM’s backend (not the template) applies output constraints—such as guided decoding or structured outputs—based on the "tool_choice" parameter. This means even if your chat template doesn’t reference "tool_choice", vLLM will still limit the model’s output to valid tool calls when "tool_choice": "required" is set, by constraining the generation process itself, not by modifying the prompt string alone see docs.

Would you like more detail on how vLLM applies these constraints after prompt formatting?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM will vllm change the formatted string after the chat_template processing?

No, vLLM does not change the formatted string after the chat_template processing. Instead, it applies output constraints (like guided decoding or structured outputs) during generation, but the prompt string itself—produced by the chat template—remains unchanged after formatting see docs.

So, the chat template output is used as-is for the prompt, and any restrictions or requirements (such as for tool calls) are enforced at the generation/output stage, not by modifying the prompt string after chat template application.

Would you like more detail on how output constraints work in vLLM?

Sources:

Hint: Mention @RunLLM in the post for followups.

1 Like

@RunLLM what does “guided decoding or structured outputs” you said mean? that’s magic that it can let model generate only tool_call information

Guided decoding (structured outputs) in vLLM means the model’s output is constrained during generation to strictly follow a specified format, such as a JSON schema for tool calls. When enabled (e.g., with "tool_choice": "required"), vLLM uses techniques like finite state machines to ensure only valid tool call outputs are produced, blocking any free-form text that doesn’t match the schema see docs and structured outputs.

This is not magic—it’s an enforced decoding process that restricts the model’s token choices at each step, so only outputs matching the required structure (like tool calls) are possible.

Would you like a technical breakdown of how guided decoding works in vLLM?

Sources:

Hint: Mention @RunLLM in the post for followups.

1 Like

@RunLLM what about "response_format"? if I add it into request, will vllm use it with the same way?

Yes, if you add "response_format" (e.g., {"type": "json_object"}) to your request, vLLM will use guided decoding (structured outputs) to constrain the model’s output to valid JSON, similar to how it handles "tool_choice": "required". This means the model’s output will be forced to match the specified format, not free-form text, by restricting token generation to valid JSON structures see docs.

So, "response_format" triggers the same type of output constraint mechanism as tool calling, ensuring the output matches the requested structure.

Would you like more detail or an example of using "response_format"?

Sources:

Hint: Mention @RunLLM in the post for followups.

1 Like