Hi all,
We’re running Qwen/Qwen3.5-9B with thinking mode enabled on vLLM v0.18.0, using the Python API directly (not the OpenAI-compatible server). Our use case is structured JSON extraction from document images — some pages are data-dense and cause the model to enter long reasoning loops, exhausting all output tokens on <think> content without ever producing </think> or the actual JSON response.
We wanted to use thinking_token_budget to cap the reasoning tokens and force the model to transition to output. However, this parameter is not a first-class field in vLLM 0.18.0’s SamplingParams:
from vllm import SamplingParams
import inspect
params = [p for p in inspect.signature(SamplingParams).parameters if 'think' in p.lower()]
# Result: [] — no thinking-related parameters
We found that SamplingParams accepts an extra_args dict, so we passed it there:
sp = SamplingParams(
max_tokens=16384,
temperature=0.6,
extra_args={"thinking_token_budget": 4096}
)
This is accepted without error and stored in sp.extra_args. We tested with a trivial input (1x1 pixel image) and it appeared to work — the model finished thinking in ~50 tokens and produced valid output. However, when we ran it across 23 real document pages, the budget was not enforced on dense inputs.
Evidence:
We logged thinking vs response content per page. On 21 of 23 pages, the model wrapped its reasoning in <think>...</think> tags and produced output normally. But on the two densest pages (21 and 23):
| Page | Thinking (chars) | Response (chars) | Has <think> tags |
Valid JSON? |
|---|---|---|---|---|
| 21 | 0 | 53,685 | No | No |
| 23 | 0 | 36,849 | No | No |
These pages show 0 chars of thinking and massive response bodies — meaning the model never emitted <think> tags at all. The “thinking” was dumped as raw unstructured text, and the model never transitioned to producing JSON. This is the exact same failure mode we see without the budget parameter.
Pages with moderate complexity (e.g., pages 13, 14 with ~20K chars of thinking) succeeded — but we believe those completed thinking naturally within a reasonable limit, not because the budget was enforced.
Questions:
- Is
thinking_token_budgetactually read fromextra_argsby the engine in v0.18.0, or is it only effective when it’s a nativeSamplingParamsfield? - We saw PR #37112 which adds
reasoning_budget— is this the correct path forward? Which vLLM version will include it? - Is there any other way to enforce a thinking token cap in the current release (0.18.0) without modifying vLLM internals?
Any guidance would be appreciated. Happy to share logs or the full per-page breakdown if helpful.
Thanks!