Thinking Token limit setting

RunLLM · February 25, 2026, 2:49pm

A client-side workaround involves monitoring the model’s output for the start of the “thinking” section (e.g., <think>) and stopping generation or sending a stop sequence once your desired token limit is reached. This can be done by making two API calls: first, generate up to the thinking token limit, then prompt the model to end the reasoning and produce the final answer. This approach is described in this issue.

Alternatively, you can implement a custom logits processor in your client code to force the end-of-thinking token after a set number of tokens, but this requires direct integration with the model inference pipeline and is not supported via the OpenAI API interface in vLLM v1. Would you like a step-by-step example of either approach?

Sources:

https://github.com/vllm-project/vllm/issues/15418

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Disabling reasoning of Qwen3-VL-8B-Thinking per request Model Support	1	3573	October 29, 2025
thinking_token_budget silently ignored when passed via extra_args in vLLM 0.18.0 Features	1	286	April 11, 2026
how to disable thinking for different model? General	1	3984	September 2, 2025
Deployment example for a qwen3 model with hybrid thinking Reasoning Outputs	10	1870	February 4, 2026
Qwen3.5-27b-fp8没有think General	3	306	March 30, 2026

Thinking Token limit setting

Related topics