Thinking Token limit setting

RunLLM · February 25, 2026, 2:46pm

There is currently no built-in vLLM command-line argument or documented API to directly set a “thinking token limit” (e.g., max_thinking_tokens) for reasoning models like Qwen3 in the docker compose or server command. Limiting the number of thinking tokens is a requested feature and can be implemented at the application level using a custom logits processor, but vLLM v1 does not natively support this as a server flag or config parameter yet. See discussion and workarounds in this issue.

If you need to enforce a thinking token limit, you must implement a custom logits processor in your client or application code, not via the docker compose or vLLM server command. Would you like more detail on how to implement this workaround?

Sources:

https://github.com/vllm-project/vllm/issues/15418

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Disabling reasoning of Qwen3-VL-8B-Thinking per request Model Support	1	3713	October 29, 2025
thinking_token_budget silently ignored when passed via extra_args in vLLM 0.18.0 Features	1	390	April 11, 2026
how to disable thinking for different model? General	1	4124	September 2, 2025
Deployment example for a qwen3 model with hybrid thinking Reasoning Outputs	10	1986	February 4, 2026
Qwen3.5-27b-fp8没有think General	3	367	March 30, 2026

Thinking Token limit setting

Related topics