Deployment example for a qwen3 model with hybrid thinking

hustxiayang · August 26, 2025, 5:54pm

Could you please show me a deployment example for a qwen3 model with hybrid thinking, for example, Qwen3-30B-A3B, and example of disabling thinking

RunLLM · August 26, 2025, 5:55pm

To deploy Qwen3-30B-A3B with hybrid thinking (reasoning enabled), use the following command, specifying the reasoning parser as qwen3:

vllm serve Qwen/Qwen3-30B-A3B --enable-reasoning --reasoning-parser qwen3

To disable thinking (reasoning) for a specific request, pass chat_template_kwargs={“enable_thinking”: False} in your API call or Python client:

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
messages = [{"role": "user", "content": "Give me a short introduction to large language models."}]
response = client.chat.completions.create(
    model="Qwen/Qwen3-30B-A3B",
    messages=messages,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)
print(response.choices[0].message.content)

According to the docs, the Qwen3 reasoning feature is enabled by default and can be disabled per request as shown above. For more details, see the official reasoning outputs documentation and Qwen3 reasoning parser API.

Would you like more detail on hybrid thinking or deployment options?

Sources:

hustxiayang · August 26, 2025, 5:57pm

--enable-reasoning is outdated.

RunLLM · August 26, 2025, 5:57pm

Correct, the --enable-reasoning flag is deprecated and has been removed as of vLLM v0.10.0. You should use only --reasoning-parser (e.g., --reasoning-parser qwen3) to enable reasoning mode for supported models. When --reasoning-parser is specified, reasoning is automatically enabled—no need for --enable-reasoning anymore. See engine args documentation and deprecation notice.

Would you like more detail on migration or usage?

Sources:

hustxiayang · August 26, 2025, 6:02pm

Do I still need to set chat_template_kwargs to disable thinking?

RunLLM · August 26, 2025, 6:02pm

Yes, to disable thinking (reasoning) for Qwen3 and similar models, you still need to set chat_template_kwargs={“enable_thinking”: False} in your API request. This is the recommended way to turn off reasoning on a per-request basis, even when using the latest vLLM and reasoning parser. See reasoning outputs documentation and Qwen3 usage guide.

Would you like more detail or code examples?

Sources:

hustxiayang · August 26, 2025, 6:05pm

I set it, but still got reasoning process?

RunLLM · August 26, 2025, 6:05pm

If you set chat_template_kwargs={“enable_thinking”: False} but still receive reasoning content, ensure you are using the correct reasoning parser (qwen3 for Qwen3 models) and the latest vLLM version. Some older versions or incorrect parser settings may ignore the disable flag, causing reasoning content to appear regardless. This issue was fixed in recent vLLM releases—updating to v0.9.1+ and using --reasoning-parser qwen3 should resolve it (reasoning outputs doc, bugfix PR).

Would you like troubleshooting steps or a version check guide?

Sources:

Topic		Replies	Views
how to disable thinking for different model? General	1	357	September 2, 2025
Pass instructions to Qwen Embedding / Reranker via OpenAI-compatible server? Features	5	140	September 11, 2025
Does the latest version support deepseek-v3 tool call Model Support	0	89	April 12, 2025
Init DeepSeek-R1 using Offline Batched Inference General	3	179	May 18, 2025
Using openai compatible with `beta.chat.completions.parse` can't do tool call and structured output together General	0	62	April 6, 2025

Deployment example for a qwen3 model with hybrid thinking

Related topics