thinking_token_budget silently ignored when passed via extra_args in vLLM 0.18.0

Hi all,

We’re running Qwen/Qwen3.5-9B with thinking mode enabled on vLLM v0.18.0, using the Python API directly (not the OpenAI-compatible server). Our use case is structured JSON extraction from document images — some pages are data-dense and cause the model to enter long reasoning loops, exhausting all output tokens on <think> content without ever producing </think> or the actual JSON response.

We wanted to use thinking_token_budget to cap the reasoning tokens and force the model to transition to output. However, this parameter is not a first-class field in vLLM 0.18.0’s SamplingParams:

from vllm import SamplingParams
import inspect
params = [p for p in inspect.signature(SamplingParams).parameters if 'think' in p.lower()]
# Result: [] — no thinking-related parameters

We found that SamplingParams accepts an extra_args dict, so we passed it there:

sp = SamplingParams(
    max_tokens=16384,
    temperature=0.6,
    extra_args={"thinking_token_budget": 4096}
)

This is accepted without error and stored in sp.extra_args. We tested with a trivial input (1x1 pixel image) and it appeared to work — the model finished thinking in ~50 tokens and produced valid output. However, when we ran it across 23 real document pages, the budget was not enforced on dense inputs.

Evidence:

We logged thinking vs response content per page. On 21 of 23 pages, the model wrapped its reasoning in <think>...</think> tags and produced output normally. But on the two densest pages (21 and 23):

Page Thinking (chars) Response (chars) Has <think> tags Valid JSON?
21 0 53,685 No No
23 0 36,849 No No

These pages show 0 chars of thinking and massive response bodies — meaning the model never emitted <think> tags at all. The “thinking” was dumped as raw unstructured text, and the model never transitioned to producing JSON. This is the exact same failure mode we see without the budget parameter.

Pages with moderate complexity (e.g., pages 13, 14 with ~20K chars of thinking) succeeded — but we believe those completed thinking naturally within a reasonable limit, not because the budget was enforced.

Questions:

  1. Is thinking_token_budget actually read from extra_args by the engine in v0.18.0, or is it only effective when it’s a native SamplingParams field?
  2. We saw PR #37112 which adds reasoning_budget — is this the correct path forward? Which vLLM version will include it?
  3. Is there any other way to enforce a thinking token cap in the current release (0.18.0) without modifying vLLM internals?

Any guidance would be appreciated. Happy to share logs or the full per-page breakdown if helpful.

Thanks!

thinking_token_budget is not a native SamplingParams field in vLLM v0.18.0 and is not enforced if passed via extra_args; it only works when supported as a first-class parameter. The ability to cap reasoning tokens via thinking_token_budget (or reasoning_budget) was added in later PRs (e.g., #20859, #37112) and is available in vLLM mainline after those merges, not in v0.18.0. There is no supported way to enforce a thinking token cap in v0.18.0 without modifying vLLM internals or upgrading to a newer version where this feature is natively supported in SamplingParams and the engine logic.

For full support, upgrade to a vLLM version after PR #20859 and #37112, where thinking_token_budget is a documented, enforced parameter in SamplingParams and the engine. See ThinkingTokenBudgetLogitsProcessor docs and SamplingParams docs for usage. There is no workaround in v0.18.0 that guarantees enforcement of a reasoning token budget for thinking mode models without code changes.

Would you like a step-by-step upgrade and usage guide for the latest vLLM with enforced thinking budgets?

Sources:

Hint: Mention @RunLLM in the post for followups.