@RunLLM , I am getting below warning while serving Qwen/Qwen3.6-35B-A3B-FP8 model using v0.19.0 of vLLM. docker compose used: services: qwen3.6: image: vllm/vllm-openai:v0.19.0 container_name: qwen3.6 runtime: nvidia environment: NVIDIA_VISIBLE_DEVICES=all volumes: /home/ranjith/.cache/hug…

The warning (EngineCore pid=77) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (32)... is expected and not an error. It occurs when the sequence length is less than …

[image] RunLLM: Would you like more detail on why this warning appears or how to suppress it? @RunLLM , Would you like more detail on why this warning appears or how to suppress it? yes

Certainly! Here’s a step-by-step explanation: Why does this warning appear? The warning Input tensor shape suggests potential format mismatch: seq_len (X) < num_heads (Y) is triggered in vLLM’s Flash Linear Attention (FLA) ops when the input tensor’s sequence length (number of tokens) is less than…

[image] Ranjith: Qwen/Qwen3.6-35B-A3B-FP8 @RunLLM , Is there any option to make model to stop thinking in model Qwen/Qwen3.6-35B-A3B-FP8?

Yes, you can disable “thinking” mode in Qwen/Qwen3.6-35B-A3B-FP8 by setting "enable_thinking": false in the chat_template_kwargs field of your API request payload. This works for Qwen3.5/3.6 models in vLLM ( source ). Would you like a code example or more detail? Sources: vllm-project/vllm#35574 H…

Setting "enable_thinking": false in chat_template_kwargs for Qwen3.5/3.6 models disables the generation of reasoning/thinking content—the model will not produce or output any thinking tokens, and only the final answer is generated ( docs.vllm.ai reasoning outputs , vllm-project/vllm#35574 ). If you sti…

Warning while serving Qwen/Qwen3.6-35B-A3B-FP8

General

Ranjith April 21, 2026, 7:11am 7

@RunLLM But making “enable_thinking”: false is not avoiding thinking in the model it is just not streaming the thinking tokens as output

Topic		Replies	Views
Disabling reasoning of Qwen3-VL-8B-Thinking per request Model Support	1	3713	October 29, 2025
Trying to run Qwen3.5-397B-A17B-GPTQ-Int4 Model Support	10	568	March 13, 2026
Qwen3.5-27b-fp8没有think General	3	367	March 30, 2026
RTX PRO6000运行Qwen3-32B-FP8量化版本出错 General	1	510	July 4, 2025
RuntimeError: Int8 not supported on SM120. Use FP8 quantization instead, or run on older arch (SM < 100) NVIDIA GPU Support	1	230	November 19, 2025

Warning while serving Qwen/Qwen3.6-35B-A3B-FP8

Related topics