Disabling reasoning of Qwen3-VL-8B-Thinking per request

Hello,

I started my vLLM instance with Qwen3-VL-8B-Thinking model with this script:

sudo docker run -d --runtime=nvidia --gpus "device=0" \

  -p 8002:8000 --name q8t \

-v ~/.cache/huggingface:/root/.cache/huggingface \

vllm/vllm-openai:v0.11.0 \

--model Qwen/Qwen3-VL-8B-Thinking \

--served-model-name generator \

--tensor-parallel-size 1 \

--max-model-len 8000 \

--max-num-seqs 1 \

--gpu-memory-utilization 0.95 \

--enforce-eager \

--dtype bfloat16 \

--chat-template-content-format openai

I would like to enable or disable thinking per request but i do not know how to do that since whatever i try (/no_think) the model always reasons about the prompt.

Is it even possible to turn on or off the reasoning of Qwen3-VL Thinking models?

Yes, you can enable or disable “thinking” (reasoning) per request for Qwen3-VL-Thinking models in vLLM by passing the parameter enable_thinking in the request body. For Qwen3 series, set extra_body={"chat_template_kwargs": {"enable_thinking": False}} to disable, or True to enable, reasoning for that specific request. However, for Qwen3-VL-Thinking models, the default chat template always includes the <think> tag, so disabling thinking via this parameter may not work as expected, and the model may still output reasoning content regardless of the flag. This is a known limitation due to how the chat template is constructed for these models, and using the DeepSeek-R1 reasoning parser is sometimes recommended as a workaround, but it does not fully resolve the issue for all Qwen3-VL-Thinking variants (source, source).

In summary: for most Qwen3 models, you can control reasoning per request with enable_thinking, but for Qwen3-VL-Thinking, the model’s template enforces reasoning mode, so disabling it per request is not reliably supported in vLLM as of now (source, source). Would you like more detail or example code?

Sources:

Hint: Mention @RunLLM in the post for followups.