Warning while serving Qwen/Qwen3.6-35B-A3B-FP8

@RunLLM , I am getting below warning while serving Qwen/Qwen3.6-35B-A3B-FP8 model using v0.19.0 of vLLM.

docker compose used:
services:
qwen3.6:
image: vllm/vllm-openai:v0.19.0
container_name: qwen3.6
runtime: nvidia
environment:

  • NVIDIA_VISIBLE_DEVICES=all
    volumes:
  • /home/ranjith/.cache/huggingface:/root/.cache/huggingface
  • /dev/null:/etc/ld.so.conf.d/00-cuda-compat.conf
    ports:
  • “9122:8000”
    ipc: host
    command: >
    /root/.cache/huggingface/hub/models–Qwen–Qwen3.6-35B-A3B-FP8/snapshots/61a5771f218894aaacf97551e24a25b866750fc2
    –served-model-name ranjith-model
    –gpu-memory-utilization 0.9
    –max_model_len 32768
    –max_num_batched_tokens 4096
    –enable-auto-tool-choice
    –tool-call-parser hermes
    –enforce-eager

Logs:
(APIServer pid=1) INFO 04-20 06:08:21 [utils.py:299]
(APIServer pid=1) INFO 04-20 06:08:21 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=1) INFO 04-20 06:08:21 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.19.0
(APIServer pid=1) INFO 04-20 06:08:21 [utils.py:299] █▄█▀ █ █ █ █ model /root/.cache/huggingface/hub/models–Qwen–Qwen3.6-35B-A3B-FP8/snapshots/61a5771f218894aaacf97551e24a25b866750fc2
(APIServer pid=1) INFO 04-20 06:08:21 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 04-20 06:08:21 [utils.py:299]
(APIServer pid=1) INFO 04-20 06:08:21 [utils.py:233] non-default args: {‘model_tag’: ‘/root/.cache/huggingface/hub/models–Qwen–Qwen3.6-35B-A3B-FP8/snapshots/61a5771f218894aaacf97551e24a25b866750fc2’, ‘enable_auto_tool_choice’: True, ‘tool_call_parser’: ‘hermes’, ‘model’: ‘/root/.cache/huggingface/hub/models–Qwen–Qwen3.6-35B-A3B-FP8/snapshots/61a5771f218894aaacf97551e24a25b866750fc2’, ‘max_model_len’: 32768, ‘enforce_eager’: True, ‘served_model_name’: [‘ranjith-model’], ‘max_num_batched_tokens’: 4096}
(APIServer pid=1) INFO 04-20 06:08:21 [model.py:549] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=1) INFO 04-20 06:08:21 [model.py:1678] Using max model len 32768
(APIServer pid=1) INFO 04-20 06:08:21 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=1) INFO 04-20 06:08:21 [config.py:281] Setting attention block size to 1056 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 04-20 06:08:21 [config.py:312] Padding mamba page size by 0.76% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) INFO 04-20 06:08:21 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 04-20 06:08:21 [vllm.py:848] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=1) WARNING 04-20 06:08:21 [vllm.py:859] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 04-20 06:08:21 [vllm.py:1025] Cudagraph is disabled under eager mode
(APIServer pid=1) INFO 04-20 06:08:21 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=77) INFO 04-20 06:08:34 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model=‘/root/.cache/huggingface/hub/models–Qwen–Qwen3.6-35B-A3B-FP8/snapshots/61a5771f218894aaacf97551e24a25b866750fc2’, speculative_config=None, tokenizer=‘/root/.cache/huggingface/hub/models–Qwen–Qwen3.6-35B-A3B-FP8/snapshots/61a5771f218894aaacf97551e24a25b866750fc2’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=ranjith-model, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘mode’: <CompilationMode.NONE: 0>, ‘debug_dump_path’: None, ‘cache_dir’: ‘’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘+quant_fp8’, ‘all’, ‘+quant_fp8’], ‘splitting_ops’: , ‘compile_mm_encoder’: False, ‘cudagraph_mm_encoder’: False, ‘encoder_cudagraph_token_budgets’: , ‘encoder_cudagraph_max_images_per_batch’: 0, ‘compile_sizes’: , ‘compile_ranges_endpoints’: [4096], ‘inductor_compile_config’: {‘enable_auto_functionalized_v2’: False, ‘size_asserts’: False, ‘alignment_asserts’: False, ‘scalar_asserts’: False, ‘combo_kernels’: True, ‘benchmark_combo_kernel’: True}, ‘inductor_passes’: {}, ‘cudagraph_mode’: <CUDAGraphMode.NONE: 0>, ‘cudagraph_num_of_warmups’: 0, ‘cudagraph_capture_sizes’: , ‘cudagraph_copy_inputs’: False, ‘cudagraph_specialize_lora’: True, ‘use_inductor_graph_partition’: False, ‘pass_config’: {‘fuse_norm_quant’: True, ‘fuse_act_quant’: True, ‘fuse_attn_quant’: False, ‘enable_sp’: False, ‘fuse_gemm_comms’: False, ‘fuse_allreduce_rms’: False}, ‘max_cudagraph_capture_size’: 0, ‘dynamic_shapes_config’: {‘type’: <DynamicShapesType.BACKED: ‘backed’>, ‘evaluate_guards’: False, ‘assume_32_bit_indexing’: False}, ‘local_cache_dir’: None, ‘fast_moe_cold_start’: True, ‘static_all_moe_layers’: }
(EngineCore pid=77) INFO 04-20 06:08:35 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.18.0.2:45789 backend=nccl
(EngineCore pid=77) INFO 04-20 06:08:35 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=77) INFO 04-20 06:08:39 [gpu_model_runner.py:4735] Starting to load model /root/.cache/huggingface/hub/models–Qwen–Qwen3.6-35B-A3B-FP8/snapshots/61a5771f218894aaacf97551e24a25b866750fc2…
(EngineCore pid=77) INFO 04-20 06:08:40 [cuda.py:390] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=77) INFO 04-20 06:08:40 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=77) INFO 04-20 06:08:40 [init.py:261] Selected CutlassFP8ScaledMMLinearKernel for Fp8LinearMethod
(EngineCore pid=77) INFO 04-20 06:08:40 [gdn_linear_attn.py:147] Using Triton/FLA GDN prefill kernel
(EngineCore pid=77) INFO 04-20 06:08:40 [fp8.py:396] Using TRITON Fp8 MoE backend out of potential backends: [‘AITER’, ‘FLASHINFER_TRTLLM’, ‘FLASHINFER_CUTLASS’, ‘DEEPGEMM’, ‘TRITON’, ‘MARLIN’, ‘BATCHED_DEEPGEMM’, ‘BATCHED_TRITON’, ‘XPU’].
(EngineCore pid=77) INFO 04-20 06:08:41 [cuda.py:334] Using FLASH_ATTN attention backend out of potential backends: [‘FLASH_ATTN’, ‘FLASHINFER’, ‘TRITON_ATTN’, ‘FLEX_ATTENTION’].
(EngineCore pid=77) INFO 04-20 06:08:41 [flash_attn.py:596] Using FlashAttention version 2
Loading safetensors checkpoint shards: 0% Completed | 0/42 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 2% Completed | 1/42 [00:02<01:23, 2.04s/it]
Loading safetensors checkpoint shards: 5% Completed | 2/42 [00:04<01:24, 2.11s/it]
Loading safetensors checkpoint shards: 7% Completed | 3/42 [00:06<01:22, 2.12s/it]
Loading safetensors checkpoint shards: 10% Completed | 4/42 [00:08<01:20, 2.12s/it]
Loading safetensors checkpoint shards: 12% Completed | 5/42 [00:10<01:18, 2.12s/it]
Loading safetensors checkpoint shards: 14% Completed | 6/42 [00:12<01:15, 2.09s/it]
Loading safetensors checkpoint shards: 17% Completed | 7/42 [00:14<01:14, 2.13s/it]
Loading safetensors checkpoint shards: 19% Completed | 8/42 [00:17<01:14, 2.20s/it]
Loading safetensors checkpoint shards: 21% Completed | 9/42 [00:19<01:14, 2.26s/it]
Loading safetensors checkpoint shards: 24% Completed | 10/42 [00:21<01:10, 2.19s/it]
Loading safetensors checkpoint shards: 26% Completed | 11/42 [00:23<01:06, 2.14s/it]
Loading safetensors checkpoint shards: 29% Completed | 12/42 [00:25<01:03, 2.11s/it]
Loading safetensors checkpoint shards: 31% Completed | 13/42 [00:27<00:59, 2.06s/it]
Loading safetensors checkpoint shards: 33% Completed | 14/42 [00:29<00:58, 2.08s/it]
Loading safetensors checkpoint shards: 36% Completed | 15/42 [00:31<00:55, 2.05s/it]
Loading safetensors checkpoint shards: 38% Completed | 16/42 [00:33<00:51, 1.99s/it]
Loading safetensors checkpoint shards: 40% Completed | 17/42 [00:35<00:49, 1.98s/it]
Loading safetensors checkpoint shards: 43% Completed | 18/42 [00:37<00:47, 1.96s/it]
Loading safetensors checkpoint shards: 45% Completed | 19/42 [00:39<00:45, 1.97s/it]
Loading safetensors checkpoint shards: 48% Completed | 20/42 [00:41<00:44, 2.01s/it]
Loading safetensors checkpoint shards: 50% Completed | 21/42 [00:43<00:42, 2.03s/it]
Loading safetensors checkpoint shards: 52% Completed | 22/42 [00:45<00:40, 2.02s/it]
Loading safetensors checkpoint shards: 55% Completed | 23/42 [00:47<00:37, 1.99s/it]
Loading safetensors checkpoint shards: 57% Completed | 24/42 [00:49<00:35, 1.96s/it]
Loading safetensors checkpoint shards: 60% Completed | 25/42 [00:51<00:32, 1.94s/it]
Loading safetensors checkpoint shards: 62% Completed | 26/42 [00:53<00:30, 1.88s/it]
Loading safetensors checkpoint shards: 64% Completed | 27/42 [00:54<00:28, 1.90s/it]
Loading safetensors checkpoint shards: 67% Completed | 28/42 [00:56<00:26, 1.90s/it]
Loading safetensors checkpoint shards: 69% Completed | 29/42 [00:59<00:26, 2.04s/it]
Loading safetensors checkpoint shards: 71% Completed | 30/42 [01:01<00:25, 2.11s/it]
Loading safetensors checkpoint shards: 74% Completed | 31/42 [01:03<00:23, 2.14s/it]
Loading safetensors checkpoint shards: 76% Completed | 32/42 [01:05<00:21, 2.12s/it]
Loading safetensors checkpoint shards: 79% Completed | 33/42 [01:07<00:18, 2.07s/it]
Loading safetensors checkpoint shards: 81% Completed | 34/42 [01:09<00:16, 2.08s/it]
Loading safetensors checkpoint shards: 83% Completed | 35/42 [01:12<00:15, 2.16s/it]
Loading safetensors checkpoint shards: 86% Completed | 36/42 [01:14<00:13, 2.22s/it]
Loading safetensors checkpoint shards: 88% Completed | 37/42 [01:16<00:11, 2.22s/it]
Loading safetensors checkpoint shards: 90% Completed | 38/42 [01:18<00:08, 2.15s/it]
Loading safetensors checkpoint shards: 93% Completed | 39/42 [01:20<00:06, 2.14s/it]
Loading safetensors checkpoint shards: 95% Completed | 40/42 [01:22<00:04, 2.11s/it]
Loading safetensors checkpoint shards: 98% Completed | 41/42 [01:23<00:01, 1.73s/it]
Loading safetensors checkpoint shards: 100% Completed | 42/42 [01:29<00:00, 3.08s/it]
Loading safetensors checkpoint shards: 100% Completed | 42/42 [01:29<00:00, 2.14s/it]
(EngineCore pid=77)
(EngineCore pid=77) INFO 04-20 06:10:11 [default_loader.py:384] Loading weights took 90.09 seconds
(EngineCore pid=77) INFO 04-20 06:10:11 [fp8.py:560] Using MoEPrepareAndFinalizeNoDPEPModular
(EngineCore pid=77) INFO 04-20 06:10:12 [gpu_model_runner.py:4820] Model loading took 34.23 GiB memory and 91.326528 seconds
(EngineCore pid=77) INFO 04-20 06:10:12 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore pid=77) WARNING 04-20 06:10:17 [fp8_utils.py:1185] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/configs/N=12288,K=2048,device_name=NVIDIA_L40S,dtype=fp8_w8a8,block_shape=[128,128].json
(EngineCore pid=77) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, …] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, …].
(EngineCore pid=77) return fn(*contiguous_args, **contiguous_kwargs)

The warning (EngineCore pid=77) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (32)... is expected and not an error. It occurs when the sequence length is less than the number of attention heads, which can happen during normal vLLM inference, especially with chunked prefill or small input batches. This warning does not indicate a malfunction and can be safely ignored unless you observe incorrect outputs or crashes. See discussion in vllm-project/vllm#37103.

Would you like more detail on why this warning appears or how to suppress it?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM , Would you like more detail on why this warning appears or how to suppress it?
yes

Certainly! Here’s a step-by-step explanation:

Why does this warning appear?
The warning Input tensor shape suggests potential format mismatch: seq_len (X) < num_heads (Y) is triggered in vLLM’s Flash Linear Attention (FLA) ops when the input tensor’s sequence length (number of tokens) is less than the number of attention heads. This can happen during normal inference, especially with chunked prefill or when processing short input sequences. In vLLM, this is a common and expected scenario, not an error. The warning is meant to alert developers in case there is a real tensor shape mismatch, but for Qwen3 and similar models, it is normal for seq_len < num_heads in some inference steps (e.g., when a new chunk of tokens is shorter than the number of heads) (source).

How to suppress the warning?
The warning is generated by code copied from the flash-linear-attention library. Currently, setting os.environ["VLLM_LOGGING_LEVEL"] = "ERROR" does not suppress this warning because it is a Python UserWarning, not a log message. The vLLM team has discussed removing or reducing the verbosity of this warning in future releases, as it is generally harmless (source). For now, you can suppress all Python warnings (including this one) by adding the following at the top of your main script:

import warnings
warnings.filterwarnings("ignore", message="Input tensor shape suggests potential format mismatch")

This will hide the warning during runtime.

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM , Is there any option to make model to stop thinking in model Qwen/Qwen3.6-35B-A3B-FP8?

Yes, you can disable “thinking” mode in Qwen/Qwen3.6-35B-A3B-FP8 by setting "enable_thinking": false in the chat_template_kwargs field of your API request payload. This works for Qwen3.5/3.6 models in vLLM (source).

Would you like a code example or more detail?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM But making “enable_thinking”: false is not avoiding thinking in the model it is just not streaming the thinking tokens as output

Setting "enable_thinking": false in chat_template_kwargs for Qwen3.5/3.6 models disables the generation of reasoning/thinking content—the model will not produce or output any thinking tokens, and only the final answer is generated (docs.vllm.ai reasoning outputs, vllm-project/vllm#35574). If you still see reasoning content, ensure you are using the latest vLLM version and that the model/chat template supports this flag.

Would you like troubleshooting steps or a sample request?

Sources:

Hint: Mention @RunLLM in the post for followups.