@RunLLM , I am getting below warning while serving Qwen/Qwen3.6-35B-A3B-FP8 model using v0.19.0 of vLLM.
docker compose used:
services:
qwen3.6:
image: vllm/vllm-openai:v0.19.0
container_name: qwen3.6
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes: - /home/ranjith/.cache/huggingface:/root/.cache/huggingface
- /dev/null:/etc/ld.so.conf.d/00-cuda-compat.conf
ports: - “9122:8000”
ipc: host
command: >
/root/.cache/huggingface/hub/models–Qwen–Qwen3.6-35B-A3B-FP8/snapshots/61a5771f218894aaacf97551e24a25b866750fc2
–served-model-name ranjith-model
–gpu-memory-utilization 0.9
–max_model_len 32768
–max_num_batched_tokens 4096
–enable-auto-tool-choice
–tool-call-parser hermes
–enforce-eager
Logs:
(APIServer pid=1) INFO 04-20 06:08:21 [utils.py:299]
(APIServer pid=1) INFO 04-20 06:08:21 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=1) INFO 04-20 06:08:21 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.19.0
(APIServer pid=1) INFO 04-20 06:08:21 [utils.py:299] █▄█▀ █ █ █ █ model /root/.cache/huggingface/hub/models–Qwen–Qwen3.6-35B-A3B-FP8/snapshots/61a5771f218894aaacf97551e24a25b866750fc2
(APIServer pid=1) INFO 04-20 06:08:21 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 04-20 06:08:21 [utils.py:299]
(APIServer pid=1) INFO 04-20 06:08:21 [utils.py:233] non-default args: {‘model_tag’: ‘/root/.cache/huggingface/hub/models–Qwen–Qwen3.6-35B-A3B-FP8/snapshots/61a5771f218894aaacf97551e24a25b866750fc2’, ‘enable_auto_tool_choice’: True, ‘tool_call_parser’: ‘hermes’, ‘model’: ‘/root/.cache/huggingface/hub/models–Qwen–Qwen3.6-35B-A3B-FP8/snapshots/61a5771f218894aaacf97551e24a25b866750fc2’, ‘max_model_len’: 32768, ‘enforce_eager’: True, ‘served_model_name’: [‘ranjith-model’], ‘max_num_batched_tokens’: 4096}
(APIServer pid=1) INFO 04-20 06:08:21 [model.py:549] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=1) INFO 04-20 06:08:21 [model.py:1678] Using max model len 32768
(APIServer pid=1) INFO 04-20 06:08:21 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=1) INFO 04-20 06:08:21 [config.py:281] Setting attention block size to 1056 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 04-20 06:08:21 [config.py:312] Padding mamba page size by 0.76% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) INFO 04-20 06:08:21 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 04-20 06:08:21 [vllm.py:848] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=1) WARNING 04-20 06:08:21 [vllm.py:859] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 04-20 06:08:21 [vllm.py:1025] Cudagraph is disabled under eager mode
(APIServer pid=1) INFO 04-20 06:08:21 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=77) INFO 04-20 06:08:34 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model=‘/root/.cache/huggingface/hub/models–Qwen–Qwen3.6-35B-A3B-FP8/snapshots/61a5771f218894aaacf97551e24a25b866750fc2’, speculative_config=None, tokenizer=‘/root/.cache/huggingface/hub/models–Qwen–Qwen3.6-35B-A3B-FP8/snapshots/61a5771f218894aaacf97551e24a25b866750fc2’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=ranjith-model, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘mode’: <CompilationMode.NONE: 0>, ‘debug_dump_path’: None, ‘cache_dir’: ‘’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘+quant_fp8’, ‘all’, ‘+quant_fp8’], ‘splitting_ops’: , ‘compile_mm_encoder’: False, ‘cudagraph_mm_encoder’: False, ‘encoder_cudagraph_token_budgets’: , ‘encoder_cudagraph_max_images_per_batch’: 0, ‘compile_sizes’: , ‘compile_ranges_endpoints’: [4096], ‘inductor_compile_config’: {‘enable_auto_functionalized_v2’: False, ‘size_asserts’: False, ‘alignment_asserts’: False, ‘scalar_asserts’: False, ‘combo_kernels’: True, ‘benchmark_combo_kernel’: True}, ‘inductor_passes’: {}, ‘cudagraph_mode’: <CUDAGraphMode.NONE: 0>, ‘cudagraph_num_of_warmups’: 0, ‘cudagraph_capture_sizes’: , ‘cudagraph_copy_inputs’: False, ‘cudagraph_specialize_lora’: True, ‘use_inductor_graph_partition’: False, ‘pass_config’: {‘fuse_norm_quant’: True, ‘fuse_act_quant’: True, ‘fuse_attn_quant’: False, ‘enable_sp’: False, ‘fuse_gemm_comms’: False, ‘fuse_allreduce_rms’: False}, ‘max_cudagraph_capture_size’: 0, ‘dynamic_shapes_config’: {‘type’: <DynamicShapesType.BACKED: ‘backed’>, ‘evaluate_guards’: False, ‘assume_32_bit_indexing’: False}, ‘local_cache_dir’: None, ‘fast_moe_cold_start’: True, ‘static_all_moe_layers’: }
(EngineCore pid=77) INFO 04-20 06:08:35 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.18.0.2:45789 backend=nccl
(EngineCore pid=77) INFO 04-20 06:08:35 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=77) INFO 04-20 06:08:39 [gpu_model_runner.py:4735] Starting to load model /root/.cache/huggingface/hub/models–Qwen–Qwen3.6-35B-A3B-FP8/snapshots/61a5771f218894aaacf97551e24a25b866750fc2…
(EngineCore pid=77) INFO 04-20 06:08:40 [cuda.py:390] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=77) INFO 04-20 06:08:40 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=77) INFO 04-20 06:08:40 [init.py:261] Selected CutlassFP8ScaledMMLinearKernel for Fp8LinearMethod
(EngineCore pid=77) INFO 04-20 06:08:40 [gdn_linear_attn.py:147] Using Triton/FLA GDN prefill kernel
(EngineCore pid=77) INFO 04-20 06:08:40 [fp8.py:396] Using TRITON Fp8 MoE backend out of potential backends: [‘AITER’, ‘FLASHINFER_TRTLLM’, ‘FLASHINFER_CUTLASS’, ‘DEEPGEMM’, ‘TRITON’, ‘MARLIN’, ‘BATCHED_DEEPGEMM’, ‘BATCHED_TRITON’, ‘XPU’].
(EngineCore pid=77) INFO 04-20 06:08:41 [cuda.py:334] Using FLASH_ATTN attention backend out of potential backends: [‘FLASH_ATTN’, ‘FLASHINFER’, ‘TRITON_ATTN’, ‘FLEX_ATTENTION’].
(EngineCore pid=77) INFO 04-20 06:08:41 [flash_attn.py:596] Using FlashAttention version 2
Loading safetensors checkpoint shards: 0% Completed | 0/42 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 2% Completed | 1/42 [00:02<01:23, 2.04s/it]
Loading safetensors checkpoint shards: 5% Completed | 2/42 [00:04<01:24, 2.11s/it]
Loading safetensors checkpoint shards: 7% Completed | 3/42 [00:06<01:22, 2.12s/it]
Loading safetensors checkpoint shards: 10% Completed | 4/42 [00:08<01:20, 2.12s/it]
Loading safetensors checkpoint shards: 12% Completed | 5/42 [00:10<01:18, 2.12s/it]
Loading safetensors checkpoint shards: 14% Completed | 6/42 [00:12<01:15, 2.09s/it]
Loading safetensors checkpoint shards: 17% Completed | 7/42 [00:14<01:14, 2.13s/it]
Loading safetensors checkpoint shards: 19% Completed | 8/42 [00:17<01:14, 2.20s/it]
Loading safetensors checkpoint shards: 21% Completed | 9/42 [00:19<01:14, 2.26s/it]
Loading safetensors checkpoint shards: 24% Completed | 10/42 [00:21<01:10, 2.19s/it]
Loading safetensors checkpoint shards: 26% Completed | 11/42 [00:23<01:06, 2.14s/it]
Loading safetensors checkpoint shards: 29% Completed | 12/42 [00:25<01:03, 2.11s/it]
Loading safetensors checkpoint shards: 31% Completed | 13/42 [00:27<00:59, 2.06s/it]
Loading safetensors checkpoint shards: 33% Completed | 14/42 [00:29<00:58, 2.08s/it]
Loading safetensors checkpoint shards: 36% Completed | 15/42 [00:31<00:55, 2.05s/it]
Loading safetensors checkpoint shards: 38% Completed | 16/42 [00:33<00:51, 1.99s/it]
Loading safetensors checkpoint shards: 40% Completed | 17/42 [00:35<00:49, 1.98s/it]
Loading safetensors checkpoint shards: 43% Completed | 18/42 [00:37<00:47, 1.96s/it]
Loading safetensors checkpoint shards: 45% Completed | 19/42 [00:39<00:45, 1.97s/it]
Loading safetensors checkpoint shards: 48% Completed | 20/42 [00:41<00:44, 2.01s/it]
Loading safetensors checkpoint shards: 50% Completed | 21/42 [00:43<00:42, 2.03s/it]
Loading safetensors checkpoint shards: 52% Completed | 22/42 [00:45<00:40, 2.02s/it]
Loading safetensors checkpoint shards: 55% Completed | 23/42 [00:47<00:37, 1.99s/it]
Loading safetensors checkpoint shards: 57% Completed | 24/42 [00:49<00:35, 1.96s/it]
Loading safetensors checkpoint shards: 60% Completed | 25/42 [00:51<00:32, 1.94s/it]
Loading safetensors checkpoint shards: 62% Completed | 26/42 [00:53<00:30, 1.88s/it]
Loading safetensors checkpoint shards: 64% Completed | 27/42 [00:54<00:28, 1.90s/it]
Loading safetensors checkpoint shards: 67% Completed | 28/42 [00:56<00:26, 1.90s/it]
Loading safetensors checkpoint shards: 69% Completed | 29/42 [00:59<00:26, 2.04s/it]
Loading safetensors checkpoint shards: 71% Completed | 30/42 [01:01<00:25, 2.11s/it]
Loading safetensors checkpoint shards: 74% Completed | 31/42 [01:03<00:23, 2.14s/it]
Loading safetensors checkpoint shards: 76% Completed | 32/42 [01:05<00:21, 2.12s/it]
Loading safetensors checkpoint shards: 79% Completed | 33/42 [01:07<00:18, 2.07s/it]
Loading safetensors checkpoint shards: 81% Completed | 34/42 [01:09<00:16, 2.08s/it]
Loading safetensors checkpoint shards: 83% Completed | 35/42 [01:12<00:15, 2.16s/it]
Loading safetensors checkpoint shards: 86% Completed | 36/42 [01:14<00:13, 2.22s/it]
Loading safetensors checkpoint shards: 88% Completed | 37/42 [01:16<00:11, 2.22s/it]
Loading safetensors checkpoint shards: 90% Completed | 38/42 [01:18<00:08, 2.15s/it]
Loading safetensors checkpoint shards: 93% Completed | 39/42 [01:20<00:06, 2.14s/it]
Loading safetensors checkpoint shards: 95% Completed | 40/42 [01:22<00:04, 2.11s/it]
Loading safetensors checkpoint shards: 98% Completed | 41/42 [01:23<00:01, 1.73s/it]
Loading safetensors checkpoint shards: 100% Completed | 42/42 [01:29<00:00, 3.08s/it]
Loading safetensors checkpoint shards: 100% Completed | 42/42 [01:29<00:00, 2.14s/it]
(EngineCore pid=77)
(EngineCore pid=77) INFO 04-20 06:10:11 [default_loader.py:384] Loading weights took 90.09 seconds
(EngineCore pid=77) INFO 04-20 06:10:11 [fp8.py:560] Using MoEPrepareAndFinalizeNoDPEPModular
(EngineCore pid=77) INFO 04-20 06:10:12 [gpu_model_runner.py:4820] Model loading took 34.23 GiB memory and 91.326528 seconds
(EngineCore pid=77) INFO 04-20 06:10:12 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore pid=77) WARNING 04-20 06:10:17 [fp8_utils.py:1185] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/configs/N=12288,K=2048,device_name=NVIDIA_L40S,dtype=fp8_w8a8,block_shape=[128,128].json
(EngineCore pid=77) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, …] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, …].
(EngineCore pid=77) return fn(*contiguous_args, **contiguous_kwargs)