Disabling reasoning of Qwen3-VL-8B-Thinking per request

milos · October 29, 2025, 10:23pm

Hello,

I started my vLLM instance with Qwen3-VL-8B-Thinking model with this script:

sudo docker run -d --runtime=nvidia --gpus "device=0" \

  -p 8002:8000 --name q8t \

-v ~/.cache/huggingface:/root/.cache/huggingface \

vllm/vllm-openai:v0.11.0 \

--model Qwen/Qwen3-VL-8B-Thinking \

--served-model-name generator \

--tensor-parallel-size 1 \

--max-model-len 8000 \

--max-num-seqs 1 \

--gpu-memory-utilization 0.95 \

--enforce-eager \

--dtype bfloat16 \

--chat-template-content-format openai

I would like to enable or disable thinking per request but i do not know how to do that since whatever i try (/no_think) the model always reasons about the prompt.

Is it even possible to turn on or off the reasoning of Qwen3-VL Thinking models?

RunLLM · October 29, 2025, 10:23pm

Yes, you can enable or disable “thinking” (reasoning) per request for Qwen3-VL-Thinking models in vLLM by passing the parameter enable_thinking in the request body. For Qwen3 series, set extra_body={"chat_template_kwargs": {"enable_thinking": False}} to disable, or True to enable, reasoning for that specific request. However, for Qwen3-VL-Thinking models, the default chat template always includes the <think> tag, so disabling thinking via this parameter may not work as expected, and the model may still output reasoning content regardless of the flag. This is a known limitation due to how the chat template is constructed for these models, and using the DeepSeek-R1 reasoning parser is sometimes recommended as a workaround, but it does not fully resolve the issue for all Qwen3-VL-Thinking variants (source, source).

In summary: for most Qwen3 models, you can control reasoning per request with enable_thinking, but for Qwen3-VL-Thinking, the model’s template enforces reasoning mode, so disabling it per request is not reliably supported in vLLM as of now (source, source). Would you like more detail or example code?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
how to disable thinking for different model? General	1	1776	September 2, 2025
Deployment example for a qwen3 model with hybrid thinking Reasoning Outputs	11	966	February 4, 2026
用vllm的pipeline方式推理Qwen3-VL-Thinking（8B)可以关闭think模式吗 General	2	229	November 26, 2025
Init DeepSeek-R1 using Offline Batched Inference General	3	236	May 18, 2025
Can reasoning_effort parameter not ne used in vllm implementation via python? Features	1	69	January 2, 2026

Disabling reasoning of Qwen3-VL-8B-Thinking per request

Related topics