Thinking Token limit setting

Ranjith · February 25, 2026, 2:46pm

@RunLLM I want to know how to set the thinking token limit when serving the model using docker compose file.

Below is the docker compose that I am using to serve the Thinking Model:
services:
qwen3-thinking:
image: vllm/vllm-openai:v0.15.0
container_name: qwen3-thinking
runtime: nvidia
environment:

NVIDIA_VISIBLE_DEVICES=all
volumes:
/home/ranjith/.cache/huggingface:/root/.cache/huggingface
/dev/null:/etc/ld.so.conf.d/00-cuda-compat.conf
ports:
“9122:8000”
ipc: host
command: >
Qwen/Qwen3-30B-A3B-Thinking-2507-FP8
–served-model-name ranjith-model
–gpu-memory-utilization 0.9
–max_model_len 16384
–max_num_batched_tokens 16384
–reasoning-parser qwen3
–enable-prefix-caching
–enable-auto-tool-choice
–tool-call-parser hermes
–enforce-eager

RunLLM · February 25, 2026, 2:46pm

There is currently no built-in vLLM command-line argument or documented API to directly set a “thinking token limit” (e.g., max_thinking_tokens) for reasoning models like Qwen3 in the docker compose or server command. Limiting the number of thinking tokens is a requested feature and can be implemented at the application level using a custom logits processor, but vLLM v1 does not natively support this as a server flag or config parameter yet. See discussion and workarounds in this issue.

If you need to enforce a thinking token limit, you must implement a custom logits processor in your client or application code, not via the docker compose or vLLM server command. Would you like more detail on how to implement this workaround?

Sources:

https://github.com/vllm-project/vllm/issues/15418

Hint: Mention @RunLLM in the post for followups.

Ranjith · February 25, 2026, 2:48pm

@RunLLM Is there any other way to control this because I am using Google ADK for building Agents using the model served using vLLM?

RunLLM · February 25, 2026, 2:48pm

No, there is no built-in way to control the thinking token limit from the vLLM server side or via OpenAI-compatible API parameters when using external agent frameworks like Google ADK. The only available methods are custom application-level logic or modifying the model prompt to encourage shorter reasoning, but these are soft controls, not hard token limits. See this discussion.

Would you like more detail on possible prompt engineering or client-side workarounds?

Sources:

https://github.com/vllm-project/vllm/issues/15418

Hint: Mention @RunLLM in the post for followups.

Ranjith · February 25, 2026, 2:49pm

@RunLLM Can you explain client-side workaround?

RunLLM · February 25, 2026, 2:49pm

A client-side workaround involves monitoring the model’s output for the start of the “thinking” section (e.g., <think>) and stopping generation or sending a stop sequence once your desired token limit is reached. This can be done by making two API calls: first, generate up to the thinking token limit, then prompt the model to end the reasoning and produce the final answer. This approach is described in this issue.

Alternatively, you can implement a custom logits processor in your client code to force the end-of-thinking token after a set number of tokens, but this requires direct integration with the model inference pipeline and is not supported via the OpenAI API interface in vLLM v1. Would you like a step-by-step example of either approach?

Sources:

https://github.com/vllm-project/vllm/issues/15418

Hint: Mention @RunLLM in the post for followups.

Ranjith · February 26, 2026, 6:03am

@RunLLM Can I atleast toggle between thinking and non-thinking?

RunLLM · February 26, 2026, 6:03am

Yes, you can toggle between thinking and non-thinking modes by setting the chat_template_kwargs parameter in your API request. For Qwen3 and similar models, use extra_body={"chat_template_kwargs": {"enable_thinking": False}} to disable thinking, or True to enable it. This works with the OpenAI-compatible API and is supported by vLLM for models with hybrid reasoning capability, as documented in reasoning outputs and Qwen3 reasoning parser.

Would you like a code example for this?

Sources:

Hint: Mention @RunLLM in the post for followups.

Ranjith · February 26, 2026, 8:47am

@RunLLM Even after setting thinking : false in docker compose the model is thinking

ranjith@rnd-test1:~/ranjith/vllm-test$ curl http://localhost:9122/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “ranjith-model”,
“messages”: [
{
“role”: “user”,
“content”: “Just reply with OK”
}
],
“max_tokens”: 100,
“temperature”: 0.7,
“stream”: false
}’
{“id”:“chatcmpl-a4631d2cd427c3fa”,“object”:“chat.completion”,“created”:1772095549,“model”:“ranjith-model”,“choices”:[{“index”:0,“message”:{“role”:“assistant”,“content”:“Hmm, the user just said “Just reply with OK”. That’s a very straightforward request. \n\nOkay, let me think about this. The user is asking me to respond with only “OK” - no extra words, no explanations. They’re being very specific about the format. \n\nI wonder if they’re testing if I can follow simple instructions, or maybe they just want a minimal response for some technical reason. Could be a developer checking API responses, or someone who hates long answers. \n\n”,“refusal”:null,“annotations”:null,“audio”:null,“function_call”:null,“tool_calls”:,“reasoning”:null,“reasoning_content”:null},“logprobs”:null,“finish_reason”:“length”,“stop_reason”:null,“token_ids”:null}],“service_tier”:null,“system_fingerprint”:null,“usage”:{“prompt_tokens”:14,“total_tokens”:114,“completion_tokens”:100,“prompt_tokens_details”:null},“prompt_logprobs”:null,“prompt_token_ids”:null,“kv_transfer_params”:null}ranjith@rnd-test1:~/ranjith/vllm-test$ cat qwen3-thinking.yaml
services:
qwen3-thinking:
image: vllm/vllm-openai:v0.15.0
container_name: qwen3-thinking
runtime: nvidia
environment:

NVIDIA_VISIBLE_DEVICES=all
volumes:
/home/ranjith/.cache/huggingface:/root/.cache/huggingface
/dev/null:/etc/ld.so.conf.d/00-cuda-compat.conf
ports:
“9122:8000”
ipc: host
command: >
Qwen/Qwen3-30B-A3B-Thinking-2507-FP8
–served-model-name ranjith-model
–gpu-memory-utilization 0.9
–max_model_len 16384
–max_num_batched_tokens 16384
–reasoning-parser qwen3
–enable-prefix-caching
–enable-auto-tool-choice
–tool-call-parser hermes
–enforce-eager
–default-chat-template-kwargs ‘{“thinking”: false}’
ranjith@rnd-test1:~/ranjith/vllm-test$

RunLLM · February 26, 2026, 8:47am

Setting --default-chat-template-kwargs '{"thinking": false}' in your docker compose only sets the server-wide default, but request-level chat_template_kwargs (if present) will override it. However, for Qwen3 models, the correct key to disable thinking is enable_thinking: false, not just thinking: false—using the wrong key will not disable reasoning mode. See reasoning outputs docs.

To fix this, update your docker compose to use --default-chat-template-kwargs '{"enable_thinking": false}' instead. This will properly disable thinking mode for all requests unless overridden at the request level.

Sources:

https://docs.vllm.ai/en/latest/features/reasoning_outputs/#enabling-thinking-mode-by-default

Hint: Mention @RunLLM in the post for followups.

Ranjith · February 26, 2026, 9:11am

@RunLLM

ranjith@rnd-test1:~/ranjith/vllm-test$ curl http://localhost:9122/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “ranjith-model”,
“messages”: [
{
“role”: “user”,
“content”: “Just reply with OK”
}
],
“max_tokens”: 100,
“temperature”: 0.7,
“stream”: false
}’
{“id”:“chatcmpl-89a2eb6d0cab8490”,“object”:“chat.completion”,“created”:1772097050,“model”:“ranjith-model”,“choices”:[{“index”:0,“message”:{“role”:“assistant”,“content”:“Okay, the user just said “Just reply with OK”. Let me think about this. They want me to respond with just “OK” and nothing else. It’s a straightforward request.\n\nHmm, I need to make sure I follow their instruction precisely. They don’t want any additional text, just “OK”. So I shouldn’t add anything else, even if I want to be helpful. The user might be testing if I can follow simple commands, or maybe they’re in a hurry and just”,“refusal”:null,“annotations”:null,“audio”:null,“function_call”:null,“tool_calls”:,“reasoning”:null,“reasoning_content”:null},“logprobs”:null,“finish_reason”:“length”,“stop_reason”:null,“token_ids”:null}],“service_tier”:null,“system_fingerprint”:null,“usage”:{“prompt_tokens”:14,“total_tokens”:114,“completion_tokens”:100,“prompt_tokens_details”:null},“prompt_logprobs”:null,“prompt_token_ids”:null,“kv_transfer_params”:null}ranjith@rnd-test1:~/ranjith/vllm-test$ cat qwen3-thinking.yaml
services:
qwen3-thinking:
image: vllm/vllm-openai:v0.15.0
container_name: qwen3-thinking
runtime: nvidia
environment:

NVIDIA_VISIBLE_DEVICES=all
volumes:
/home/ranjith/.cache/huggingface:/root/.cache/huggingface
/dev/null:/etc/ld.so.conf.d/00-cuda-compat.conf
ports:
“9122:8000”
ipc: host
command: >
Qwen/Qwen3-30B-A3B-Thinking-2507-FP8
–served-model-name ranjith-model
–gpu-memory-utilization 0.9
–max_model_len 16384
–max_num_batched_tokens 16384
–reasoning-parser qwen3
–enable-prefix-caching
–enable-auto-tool-choice
–tool-call-parser hermes
–enforce-eager
–default-chat-template-kwargs ‘{“enable_thinking”: false}’
ranjith@rnd-test1:~/ranjith/vllm-test$

Still same issue

RunLLM · February 26, 2026, 9:12am

If you are still seeing “thinking” behavior even after setting --default-chat-template-kwargs '{"enable_thinking": false}', it may be due to the specific Qwen3-Thinking model you are using. Some Qwen3-Thinking models are designed to always generate reasoning traces, regardless of the enable_thinking flag, because their chat template or training enforces thinking mode by default. This is a known limitation discussed in vLLM issues and Qwen3 model docs.

In summary: For Qwen3-Thinking-2507 models, disabling thinking via enable_thinking: false may not work as expected due to model/template design. There is no server-side workaround for these models; you must use a non-thinking or hybrid version if you want to reliably disable reasoning.

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Disabling reasoning of Qwen3-VL-8B-Thinking per request Model Support	1	3136	October 29, 2025
thinking_token_budget silently ignored when passed via extra_args in vLLM 0.18.0 Features	1	78	April 11, 2026
how to disable thinking for different model? General	1	3418	September 2, 2025
Deployment example for a qwen3 model with hybrid thinking Reasoning Outputs	10	1510	February 4, 2026
Qwen3.5-27b-fp8没有think General	3	145	March 30, 2026

Thinking Token limit setting

Related topics