我的显卡是8张L20,为什么运行glm5-FP8报错了

(vllm_glm) solution@wa5480-g3-01:~/vllm_glm$ vllm serve glm
–tensor-parallel-size 8
–speculative-config.method mtp
–speculative-config.num_speculative_tokens 1
–tool-call-parser glm47
–reasoning-parser glm45
–enable-auto-tool-choice
–served-model-name glm-5-fp8
–port 8888
(APIServer pid=55145) INFO 03-19 16:39:58 [utils.py:287]
(APIServer pid=55145) INFO 03-19 16:39:58 [utils.py:287] █ █ █▄ ▄█
(APIServer pid=55145) INFO 03-19 16:39:58 [utils.py:287] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.16.0rc2.dev123+gec12d39d4
(APIServer pid=55145) INFO 03-19 16:39:58 [utils.py:287] █▄█▀ █ █ █ █ model glm
(APIServer pid=55145) INFO 03-19 16:39:58 [utils.py:287] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=55145) INFO 03-19 16:39:58 [utils.py:287]
(APIServer pid=55145) INFO 03-19 16:39:58 [utils.py:223] non-default args: {‘model_tag’: ‘glm’, ‘port’: 8888, ‘enable_auto_tool_choice’: True, ‘tool_call_parser’: ‘glm47’, ‘model’: ‘glm’, ‘served_model_name’: [‘glm-5-fp8’], ‘reasoning_parser’: ‘glm45’, ‘tensor_parallel_size’: 8, ‘speculative_config’: {‘method’: ‘mtp’, ‘num_speculative_tokens’: 1}}
(APIServer pid=55145) INFO 03-19 16:39:58 [model.py:531] Resolved architecture: GlmMoeDsaForCausalLM
(APIServer pid=55145) INFO 03-19 16:39:58 [model.py:1555] Using max model len 202752
(APIServer pid=55145) INFO 03-19 16:39:58 [model.py:531] Resolved architecture: DeepSeekMTPModel
(APIServer pid=55145) INFO 03-19 16:39:58 [model.py:1555] Using max model len 202752
(APIServer pid=55145) INFO 03-19 16:39:58 [scheduler.py:224] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=55145) INFO 03-19 16:39:58 [vllm.py:698] Asynchronous scheduling is enabled.
(APIServer pid=55145) INFO 03-19 16:39:58 [cuda.py:248] Forcing kv cache block size to 64 for FlashMLASparse backend.
(APIServer pid=55145) The following generation flags are not valid and may be ignored: [‘top_p’]. Set TRANSFORMERS_VERBOSITY=info for more details.
(EngineCore_DP0 pid=55416) INFO 03-19 16:40:05 [core.py:97] Initializing a V1 LLM engine (v0.16.0rc2.dev123+gec12d39d4) with config: model=‘glm’, speculative_config=SpeculativeConfig(method=‘mtp’, model=‘glm’, num_spec_tokens=1), tokenizer=‘glm’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=202752, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘glm45’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=glm-5-fp8, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘level’: None, ‘mode’: <CompilationMode.VLLM_COMPILE: 3>, ‘debug_dump_path’: None, ‘cache_dir’: ‘’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘+quant_fp8’, ‘none’, ‘+quant_fp8’], ‘splitting_ops’: [‘vllm::unified_attention’, ‘vllm::unified_attention_with_output’, ‘vllm::unified_mla_attention’, ‘vllm::unified_mla_attention_with_output’, ‘vllm::mamba_mixer2’, ‘vllm::mamba_mixer’, ‘vllm::short_conv’, ‘vllm::linear_attention’, ‘vllm::plamo2_mamba_mixer’, ‘vllm::gdn_attention_core’, ‘vllm::kda_attention’, ‘vllm::sparse_attn_indexer’, ‘vllm::rocm_aiter_sparse_attn_indexer’, ‘vllm::unified_kv_cache_update’], ‘compile_mm_encoder’: False, ‘compile_sizes’: , ‘compile_ranges_split_points’: [2304], ‘inductor_compile_config’: {‘enable_auto_functionalized_v2’: False, ‘combo_kernels’: True, ‘benchmark_combo_kernel’: True}, ‘inductor_passes’: {}, ‘cudagraph_mode’: <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, ‘cudagraph_num_of_warmups’: 1, ‘cudagraph_capture_sizes’: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], ‘cudagraph_copy_inputs’: False, ‘cudagraph_specialize_lora’: True, ‘use_inductor_graph_partition’: False, ‘pass_config’: {‘fuse_norm_quant’: True, ‘fuse_act_quant’: True, ‘fuse_attn_quant’: False, ‘enable_sp’: False, ‘fuse_gemm_comms’: False, ‘fuse_allreduce_rms’: False, ‘fuse_act_padding’: False}, ‘max_cudagraph_capture_size’: 512, ‘dynamic_shapes_config’: {‘type’: <DynamicShapesType.BACKED: ‘backed’>, ‘evaluate_guards’: False, ‘assume_32_bit_indexing’: False}, ‘local_cache_dir’: None, ‘fast_moe_cold_start’: False, ‘static_all_moe_layers’: }
(EngineCore_DP0 pid=55416) WARNING 03-19 16:40:05 [multiproc_executor.py:921] Reducing Torch parallelism from 88 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 03-19 16:40:10 [parallel_state.py:1246] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:48043 backend=nccl
INFO 03-19 16:40:15 [parallel_state.py:1246] world_size=8 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:48043 backend=nccl
INFO 03-19 16:40:20 [parallel_state.py:1246] world_size=8 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:48043 backend=nccl
INFO 03-19 16:40:25 [parallel_state.py:1246] world_size=8 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:48043 backend=nccl
INFO 03-19 16:40:29 [parallel_state.py:1246] world_size=8 rank=4 local_rank=4 distributed_init_method=tcp://127.0.0.1:48043 backend=nccl
INFO 03-19 16:40:34 [parallel_state.py:1246] world_size=8 rank=5 local_rank=5 distributed_init_method=tcp://127.0.0.1:48043 backend=nccl
INFO 03-19 16:40:39 [parallel_state.py:1246] world_size=8 rank=6 local_rank=6 distributed_init_method=tcp://127.0.0.1:48043 backend=nccl
INFO 03-19 16:40:44 [parallel_state.py:1246] world_size=8 rank=7 local_rank=7 distributed_init_method=tcp://127.0.0.1:48043 backend=nccl
INFO 03-19 16:40:44 [pynccl.py:111] vLLM is using nccl==2.28.9
WARNING 03-19 16:40:44 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.9 not supported, communicator is not available.
WARNING 03-19 16:40:44 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.9 not supported, communicator is not available.
WARNING 03-19 16:40:44 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.9 not supported, communicator is not available.
WARNING 03-19 16:40:44 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.9 not supported, communicator is not available.
WARNING 03-19 16:40:44 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.9 not supported, communicator is not available.
WARNING 03-19 16:40:44 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.9 not supported, communicator is not available.
WARNING 03-19 16:40:44 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.9 not supported, communicator is not available.
WARNING 03-19 16:40:44 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.9 not supported, communicator is not available.
WARNING 03-19 16:40:44 [custom_all_reduce.py:154] Custom allreduce is disabled because it’s not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 03-19 16:40:44 [custom_all_reduce.py:154] Custom allreduce is disabled because it’s not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 03-19 16:40:44 [custom_all_reduce.py:154] Custom allreduce is disabled because it’s not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 03-19 16:40:44 [custom_all_reduce.py:154] Custom allreduce is disabled because it’s not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 03-19 16:40:44 [custom_all_reduce.py:154] Custom allreduce is disabled because it’s not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 03-19 16:40:44 [custom_all_reduce.py:154] Custom allreduce is disabled because it’s not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 03-19 16:40:44 [custom_all_reduce.py:154] Custom allreduce is disabled because it’s not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 03-19 16:40:44 [custom_all_reduce.py:154] Custom allreduce is disabled because it’s not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 03-19 16:40:44 [parallel_state.py:1474] rank 4 in world size 8 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 4, EP rank 4, EPLB rank N/A
INFO 03-19 16:40:44 [parallel_state.py:1474] rank 6 in world size 8 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 6, EP rank 6, EPLB rank N/A
INFO 03-19 16:40:44 [parallel_state.py:1474] rank 7 in world size 8 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 7, EP rank 7, EPLB rank N/A
INFO 03-19 16:40:44 [parallel_state.py:1474] rank 5 in world size 8 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 5, EP rank 5, EPLB rank N/A
INFO 03-19 16:40:44 [parallel_state.py:1474] rank 2 in world size 8 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 2, EP rank 2, EPLB rank N/A
INFO 03-19 16:40:44 [parallel_state.py:1474] rank 1 in world size 8 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1, EPLB rank N/A
INFO 03-19 16:40:44 [parallel_state.py:1474] rank 3 in world size 8 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 3, EP rank 3, EPLB rank N/A
INFO 03-19 16:40:44 [parallel_state.py:1474] rank 0 in world size 8 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
WARNING 03-19 16:40:45 [init.py:204] min_p, logit_bias, and min_tokens parameters won’t currently work with speculative decoding enabled.
WARNING 03-19 16:40:45 [init.py:204] min_p, logit_bias, and min_tokens parameters won’t currently work with speculative decoding enabled.
WARNING 03-19 16:40:45 [init.py:204] min_p, logit_bias, and min_tokens parameters won’t currently work with speculative decoding enabled.
WARNING 03-19 16:40:45 [init.py:204] min_p, logit_bias, and min_tokens parameters won’t currently work with speculative decoding enabled.
WARNING 03-19 16:40:45 [init.py:204] min_p, logit_bias, and min_tokens parameters won’t currently work with speculative decoding enabled.
WARNING 03-19 16:40:45 [init.py:204] min_p, logit_bias, and min_tokens parameters won’t currently work with speculative decoding enabled.
WARNING 03-19 16:40:45 [init.py:204] min_p, logit_bias, and min_tokens parameters won’t currently work with speculative decoding enabled.
WARNING 03-19 16:40:45 [init.py:204] min_p, logit_bias, and min_tokens parameters won’t currently work with speculative decoding enabled.
(Worker_TP0 pid=55641) INFO 03-19 16:40:45 [gpu_model_runner.py:4124] Starting to load model glm…
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] WorkerProc failed to start.
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] WorkerProc failed to start.
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] Traceback (most recent call last):
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] Traceback (most recent call last):
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py”, line 754, in worker_main
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py”, line 754, in worker_main
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] worker = WorkerProc(*args, **kwargs)
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] worker = WorkerProc(*args, **kwargs)
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py”, line 580, in init
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py”, line 580, in init
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] self.worker.load_model()
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] self.worker.load_model()
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py”, line 294, in load_model
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py”, line 294, in load_model
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 4143, in load_model
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 4143, in load_model
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] self.model = model_loader.load_model(
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] self.model = model_loader.load_model(
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py”, line 54, in load_model
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py”, line 54, in load_model
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] model = initialize_model(
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] model = initialize_model(
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py”, line 54, in initialize_model
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py”, line 54, in initialize_model
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] model = model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] model = model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py”, line 1210, in init
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py”, line 1210, in init
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] self.model = self.model_cls(
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] self.model = self.model_cls(
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py”, line 305, in init
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py”, line 305, in init
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] old_init(self, **kwargs)
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] old_init(self, **kwargs)
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py”, line 1067, in init
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py”, line 1067, in init
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] self.start_layer, self.end_layer, self.layers = make_layers(
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] self.start_layer, self.end_layer, self.layers = make_layers(
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py”, line 707, in make_layers
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py”, line 707, in make_layers
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}“))
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] maybe_offload_to_cpu(layer_fn(prefix=f”{prefix}.{idx}“))
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py”, line 1069, in
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py”, line 1069, in
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] lambda prefix: DeepseekV2DecoderLayer(
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] lambda prefix: DeepseekV2DecoderLayer(
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py”, line 941, in init
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py”, line 941, in init
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] self.self_attn = attn_cls(
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] self.self_attn = attn_cls(
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py”, line 875, in init
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py”, line 875, in init
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] self.mla_attn = MultiHeadLatentAttentionWrapper(
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] self.mla_attn = MultiHeadLatentAttentionWrapper(
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/mla.py”, line 95, in init
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/mla.py”, line 95, in init
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] self.mla_attn = MLAAttention(
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] self.mla_attn = MLAAttention(
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/attention/mla_attention.py”, line 330, in init
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/attention/mla_attention.py”, line 330, in init
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] self.attn_backend = get_attn_backend(
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] self.attn_backend = get_attn_backend(
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/v1/attention/selector.py”, line 83, in get_attn_backend
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/v1/attention/selector.py”, line 83, in get_attn_backend
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] return _cached_get_attn_backend(
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] return _cached_get_attn_backend(
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/v1/attention/selector.py”, line 96, in _cached_get_attn_backend
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/v1/attention/selector.py”, line 96, in _cached_get_attn_backend
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] attention_cls = current_platform.get_attn_backend_cls(
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] attention_cls = current_platform.get_attn_backend_cls(
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/platforms/cuda.py”, line 354, in get_attn_backend_cls
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] File “/home/solution/vllm_glm/.venv/lib/python3.12/site-packages/vllm/platforms/cuda.py”, line 354, in get_attn_backend_cls
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] raise ValueError(
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] raise ValueError(
(Worker_TP5 pid=55823) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=AttentionType.DECODER). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported, FlashInfer MLA kernel requires qk_nope_head_dim == 128, but got 192], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.
(Worker_TP2 pid=55719) ERROR 03-19 16:40:47 [multiproc_executor.py:783] ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=AttentionType.DECODER). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported, FlashInfer MLA kernel requires qk_nope_head_dim == 128, but got 192], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.