@RunLLM I want to know how to set the thinking token limit when serving the model using docker compose file.
Below is the docker compose that I am using to serve the Thinking Model:
services:
qwen3-thinking:
image: vllm/vllm-openai:v0.15.0
container_name: qwen3-thinking
runtime: nvidia
environment:
There is currently no built-in vLLM command-line argument or documented API to directly set a “thinking token limit” (e.g., max_thinking_tokens) for reasoning models like Qwen3 in the docker compose or server command. Limiting the number of thinking tokens is a requested feature and can be implemented at the application level using a custom logits processor, but vLLM v1 does not natively support this as a server flag or config parameter yet. See discussion and workarounds in this issue.
If you need to enforce a thinking token limit, you must implement a custom logits processor in your client or application code, not via the docker compose or vLLM server command. Would you like more detail on how to implement this workaround?
No, there is no built-in way to control the thinking token limit from the vLLM server side or via OpenAI-compatible API parameters when using external agent frameworks like Google ADK. The only available methods are custom application-level logic or modifying the model prompt to encourage shorter reasoning, but these are soft controls, not hard token limits. See this discussion.
Would you like more detail on possible prompt engineering or client-side workarounds?
A client-side workaround involves monitoring the model’s output for the start of the “thinking” section (e.g., <think>) and stopping generation or sending a stop sequence once your desired token limit is reached. This can be done by making two API calls: first, generate up to the thinking token limit, then prompt the model to end the reasoning and produce the final answer. This approach is described in this issue.
Alternatively, you can implement a custom logits processor in your client code to force the end-of-thinking token after a set number of tokens, but this requires direct integration with the model inference pipeline and is not supported via the OpenAI API interface in vLLM v1. Would you like a step-by-step example of either approach?
Yes, you can toggle between thinking and non-thinking modes by setting the chat_template_kwargs parameter in your API request. For Qwen3 and similar models, use extra_body={"chat_template_kwargs": {"enable_thinking": False}} to disable thinking, or True to enable it. This works with the OpenAI-compatible API and is supported by vLLM for models with hybrid reasoning capability, as documented in reasoning outputs and Qwen3 reasoning parser.
@RunLLM Even after setting thinking : false in docker compose the model is thinking
ranjith@rnd-test1:~/ranjith/vllm-test$ curl http://localhost:9122/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “ranjith-model”,
“messages”: [
{
“role”: “user”,
“content”: “Just reply with OK”
}
],
“max_tokens”: 100,
“temperature”: 0.7,
“stream”: false
}’
{“id”:“chatcmpl-a4631d2cd427c3fa”,“object”:“chat.completion”,“created”:1772095549,“model”:“ranjith-model”,“choices”:[{“index”:0,“message”:{“role”:“assistant”,“content”:“Hmm, the user just said “Just reply with OK”. That’s a very straightforward request. \n\nOkay, let me think about this. The user is asking me to respond with only “OK” - no extra words, no explanations. They’re being very specific about the format. \n\nI wonder if they’re testing if I can follow simple instructions, or maybe they just want a minimal response for some technical reason. Could be a developer checking API responses, or someone who hates long answers. \n\n”,“refusal”:null,“annotations”:null,“audio”:null,“function_call”:null,“tool_calls”:,“reasoning”:null,“reasoning_content”:null},“logprobs”:null,“finish_reason”:“length”,“stop_reason”:null,“token_ids”:null}],“service_tier”:null,“system_fingerprint”:null,“usage”:{“prompt_tokens”:14,“total_tokens”:114,“completion_tokens”:100,“prompt_tokens_details”:null},“prompt_logprobs”:null,“prompt_token_ids”:null,“kv_transfer_params”:null}ranjith@rnd-test1:~/ranjith/vllm-test$ cat qwen3-thinking.yaml
services:
qwen3-thinking:
image: vllm/vllm-openai:v0.15.0
container_name: qwen3-thinking
runtime: nvidia
environment:
Setting --default-chat-template-kwargs '{"thinking": false}' in your docker compose only sets the server-wide default, but request-level chat_template_kwargs (if present) will override it. However, for Qwen3 models, the correct key to disable thinking is enable_thinking: false, not just thinking: false—using the wrong key will not disable reasoning mode. See reasoning outputs docs.
To fix this, update your docker compose to use --default-chat-template-kwargs '{"enable_thinking": false}' instead. This will properly disable thinking mode for all requests unless overridden at the request level.
ranjith@rnd-test1:~/ranjith/vllm-test$ curl http://localhost:9122/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “ranjith-model”,
“messages”: [
{
“role”: “user”,
“content”: “Just reply with OK”
}
],
“max_tokens”: 100,
“temperature”: 0.7,
“stream”: false
}’
{“id”:“chatcmpl-89a2eb6d0cab8490”,“object”:“chat.completion”,“created”:1772097050,“model”:“ranjith-model”,“choices”:[{“index”:0,“message”:{“role”:“assistant”,“content”:“Okay, the user just said “Just reply with OK”. Let me think about this. They want me to respond with just “OK” and nothing else. It’s a straightforward request.\n\nHmm, I need to make sure I follow their instruction precisely. They don’t want any additional text, just “OK”. So I shouldn’t add anything else, even if I want to be helpful. The user might be testing if I can follow simple commands, or maybe they’re in a hurry and just”,“refusal”:null,“annotations”:null,“audio”:null,“function_call”:null,“tool_calls”:,“reasoning”:null,“reasoning_content”:null},“logprobs”:null,“finish_reason”:“length”,“stop_reason”:null,“token_ids”:null}],“service_tier”:null,“system_fingerprint”:null,“usage”:{“prompt_tokens”:14,“total_tokens”:114,“completion_tokens”:100,“prompt_tokens_details”:null},“prompt_logprobs”:null,“prompt_token_ids”:null,“kv_transfer_params”:null}ranjith@rnd-test1:~/ranjith/vllm-test$ cat qwen3-thinking.yaml
services:
qwen3-thinking:
image: vllm/vllm-openai:v0.15.0
container_name: qwen3-thinking
runtime: nvidia
environment:
If you are still seeing “thinking” behavior even after setting --default-chat-template-kwargs '{"enable_thinking": false}', it may be due to the specific Qwen3-Thinking model you are using. Some Qwen3-Thinking models are designed to always generate reasoning traces, regardless of the enable_thinking flag, because their chat template or training enforces thinking mode by default. This is a known limitation discussed in vLLM issues and Qwen3 model docs.
In summary: For Qwen3-Thinking-2507 models, disabling thinking via enable_thinking: false may not work as expected due to model/template design. There is no server-side workaround for these models; you must use a non-thinking or hybrid version if you want to reliably disable reasoning.