RuntimeError: Int8 not supported on SM120. Use FP8 quantization instead, or run on older arch (SM < 100)

fenghuohuo2001 · November 19, 2025, 9:15am

When I used the GPQT method of the llmcompressor library to perform int8 quantization on Qwen3-VL-4B with an RTX 5090 graphics card, and ran inference using vllm version 0.11.0, the following error occurred: RuntimeError: Int8 not supported for this architecture.
However, it works normally on an RTX 4090 graphics card.

github.com/vllm-project/vllm

[Bug]: RuntimeError: Int8 not supported for this architecture

opened 12:21PM - 17 Nov 25 UTC

fenghuohuo2001

bug

### Your current environment When I used the GPQT method of the llmcompressor l…ibrary to perform int8 quantization on Qwen3-VL-4B with an RTX 5090 graphics card, and ran inference using vllm version 0.11.0, the following error occurred: RuntimeError: Int8 not supported for this architecture. However, it works normally on an RTX 4090 graphics card. ### 🐛 Describe the bug ```bash python3 -m vllm.entrypoints.openai.api_server --model /workspace/Qwen2.5-VL-7B-Instruct-20250829-quantized.w8a8 --served-model-name base_model --port 9001 --tensor-parallel-size 1 --dtype auto --enable-prefix-caching --enable-chunked-prefill --max-model-len 8000 --max-num-batched-tokens 15360 --limit-mm-per-prompt '{"image":30}' --compilation-config '{"level": 3,"compilation_mode":"default","compiler":"vllm-compiler","configs":{"model":{"vision_language":{"enable":true,"vision_encoder_compilation_mode":"disable","llm_compilation_mode":"enable"}}}}' --gpu-memory-utilization 0.7 ``` ```bash root@ms-22309-server-5090test-1-1114131839-67558b84b-mmsnq:/workspace# python3 -m vllm.entrypoints.openai.api_server --model /workspace/Qwen2.5-VL-7B-Instruct-20250829-quantized.w8a8 --served-model-name base_model --port 9001 --tensor-parallel-size 1 --dtype auto --enable-prefix-caching --enable-chunked-prefill --max-model-len 8000 --max-num-batched-tokens 15360 --limit-mm-per-prompt '{"image":30}' --compilation-config '{"level": 3,"compilation_mode":"default","compiler":"vllm-compiler","configs":{"model":{"vision_language":{"enable":true,"vision_encoder_compilation_mode":"disable","llm_compilation_mode":"enable"}}}}' --gpu-memory-utilization 0.7 INFO 11-17 19:59:36 [__init__.py:216] Automatically detected platform cuda. (APIServer pid=203326) INFO 11-17 19:59:38 [api_server.py:1839] vLLM API server version 0.11.0 (APIServer pid=203326) INFO 11-17 19:59:38 [utils.py:233] non-default args: {'port': 9001, 'model': '/workspace/Qwen2.5-VL-7B-Instruct-20250829-quantized.w8a8', 'max_model_len': 8000, 'served_model_name': ['base_model'], 'gpu_memory_utilization': 0.7, 'enable_prefix_caching': True, 'limit_mm_per_prompt': {'image': 30}, 'max_num_batched_tokens': 15360, 'enable_chunked_prefill': True, 'compilation_config': {"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":null,"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":null,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":null,"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":null,"local_cache_dir":null}} (APIServer pid=203326) INFO 11-17 19:59:38 [model.py:547] Resolved architecture: Qwen2_5_VLForConditionalGeneration (APIServer pid=203326) `torch_dtype` is deprecated! Use `dtype` instead! (APIServer pid=203326) INFO 11-17 19:59:38 [model.py:1510] Using max model len 8000 (APIServer pid=203326) INFO 11-17 19:59:39 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=15360. INFO 11-17 19:59:42 [__init__.py:216] Automatically detected platform cuda. (EngineCore_DP0 pid=203469) INFO 11-17 19:59:44 [core.py:644] Waiting for init message from front-end. (EngineCore_DP0 pid=203469) INFO 11-17 19:59:44 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='/workspace/Qwen2.5-VL-7B-Instruct-20250829-quantized.w8a8', speculative_config=None, tokenizer='/workspace/Qwen2.5-VL-7B-Instruct-20250829-quantized.w8a8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=base_model, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention","vllm.sparse_attn_indexer"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,1],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null} [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 (EngineCore_DP0 pid=203469) INFO 11-17 19:59:45 [parallel_state.py:1208] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0 (EngineCore_DP0 pid=203469) INFO 11-17 19:59:46 [topk_topp_sampler.py:55] Using FlashInfer for top-p & top-k sampling. (EngineCore_DP0 pid=203469) INFO 11-17 19:59:47 [gpu_model_runner.py:2602] Starting to load model /workspace/Qwen2.5-VL-7B-Instruct-20250829-quantized.w8a8... (EngineCore_DP0 pid=203469) INFO 11-17 19:59:47 [gpu_model_runner.py:2634] Loading model from scratch... (EngineCore_DP0 pid=203469) INFO 11-17 19:59:47 [compressed_tensors_w8a8_int8.py:52] Using CutlassScaledMMLinearKernel for CompressedTensorsW8A8Int8 (EngineCore_DP0 pid=203469) INFO 11-17 19:59:47 [cuda.py:366] Using Flash Attention backend on V1 engine. Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:00<00:01, 1.91it/s] Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:00<00:00, 2.91it/s] Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00, 2.06it/s] Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00, 2.15it/s] (EngineCore_DP0 pid=203469) (EngineCore_DP0 pid=203469) INFO 11-17 19:59:49 [default_loader.py:267] Loading weights took 1.48 seconds (EngineCore_DP0 pid=203469) INFO 11-17 19:59:49 [gpu_model_runner.py:2653] Model loading took 9.5183 GiB and 1.635616 seconds (EngineCore_DP0 pid=203469) INFO 11-17 19:59:49 [gpu_model_runner.py:3344] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size. (EngineCore_DP0 pid=203469) INFO 11-17 19:59:59 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/b2b9294305/rank_0_0/backbone for vLLM's torch.compile (EngineCore_DP0 pid=203469) INFO 11-17 19:59:59 [backends.py:559] Dynamo bytecode transform time: 5.03 s (EngineCore_DP0 pid=203469) INFO 11-17 20:00:01 [backends.py:197] Cache the graph for dynamic shape for later use (EngineCore_DP0 pid=203469) INFO 11-17 20:00:18 [backends.py:218] Compiling a graph for dynamic shape takes 19.18 s (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] EngineCore failed to start. (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] Traceback (most recent call last): (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 699, in run_engine_core (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] engine_core = EngineCoreProc(*args, **kwargs) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 498, in __init__ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] super().__init__(vllm_config, executor_class, log_stats, (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 92, in __init__ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] self._initialize_kv_caches(vllm_config) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 190, in _initialize_kv_caches (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] self.model_executor.determine_available_memory()) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 85, in determine_available_memory (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return self.collective_rpc("determine_available_memory") (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 83, in collective_rpc (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return [run_method(self.driver_worker, method, args, kwargs)] (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 3122, in run_method (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return func(*args, **kwargs) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return func(*args, **kwargs) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 263, in determine_available_memory (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] self.model_runner.profile_run() (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3392, in profile_run (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] = self._dummy_run(self.max_num_tokens, is_profile=True) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return func(*args, **kwargs) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3152, in _dummy_run (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] outputs = self.model( (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 121, in __call__ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return self.runnable(*args, **kwargs) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return self._call_impl(*args, **kwargs) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return forward_call(*args, **kwargs) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1450, in forward (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] hidden_states = self.language_model.model( (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 310, in __call__ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] output = self.compiled_callable(*args, **kwargs) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 736, in compile_wrapper (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return fn(*args, **kwargs) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 341, in forward (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] def forward( (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 375, in __call__ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return super().__call__(*args, **kwargs) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return self._call_impl(*args, **kwargs) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return forward_call(*args, **kwargs) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 929, in _fn (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return fn(*args, **kwargs) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 848, in call_wrapped (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return self._wrapped_call(self, *args, **kwargs) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 424, in __call__ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] raise e (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 411, in __call__ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc] (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return self._call_impl(*args, **kwargs) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return forward_call(*args, **kwargs) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "<eval_with_key>.58", line 317, in forward (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] submod_0 = self.submod_0(l_inputs_embeds_, s59, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, s18, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_bias_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_, l_positions_, s7); l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_bias_ = None (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 121, in __call__ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return self.runnable(*args, **kwargs) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_piecewise_backend.py", line 90, in __call__ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return self.compiled_graph_for_general_shape(*args) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 929, in _fn (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return fn(*args, **kwargs) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1241, in forward (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return compiled_fn(full_args) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 384, in runtime_wrapper (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] all_outs = call_func_at_runtime_with_args( (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] out = normalize_as_list(f(args)) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 750, in inner_fn (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] outs = compiled_fn(args) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 556, in wrapper (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return compiled_fn(runtime_args) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 584, in __call__ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return self.current_callable(inputs) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2716, in run (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] out = model(new_inputs) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/root/.cache/vllm/torch_compile_cache/b2b9294305/rank_0_0/inductor_cache/vu/cvut3jsiuhyscibpddl2xday6ys7ukt3gdq6tpn3lz2o37xwys7n.py", line 540, in call (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] torch.ops._C.cutlass_scaled_mm.default(buf7, buf0, arg4_1, buf2, arg5_1, arg6_1) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 829, in __call__ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return self._op(*args, **kwargs) (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] RuntimeError: Int8 not supported for this architecture (EngineCore_DP0 pid=203469) Process EngineCore_DP0: (EngineCore_DP0 pid=203469) Traceback (most recent call last): (EngineCore_DP0 pid=203469) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_DP0 pid=203469) self.run() (EngineCore_DP0 pid=203469) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=203469) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 712, in run_engine_core (EngineCore_DP0 pid=203469) raise e (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 699, in run_engine_core (EngineCore_DP0 pid=203469) engine_core = EngineCoreProc(*args, **kwargs) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 498, in __init__ (EngineCore_DP0 pid=203469) super().__init__(vllm_config, executor_class, log_stats, (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 92, in __init__ (EngineCore_DP0 pid=203469) self._initialize_kv_caches(vllm_config) (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 190, in _initialize_kv_caches (EngineCore_DP0 pid=203469) self.model_executor.determine_available_memory()) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 85, in determine_available_memory (EngineCore_DP0 pid=203469) return self.collective_rpc("determine_available_memory") (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 83, in collective_rpc (EngineCore_DP0 pid=203469) return [run_method(self.driver_worker, method, args, kwargs)] (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 3122, in run_method (EngineCore_DP0 pid=203469) return func(*args, **kwargs) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (EngineCore_DP0 pid=203469) return func(*args, **kwargs) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 263, in determine_available_memory (EngineCore_DP0 pid=203469) self.model_runner.profile_run() (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3392, in profile_run (EngineCore_DP0 pid=203469) = self._dummy_run(self.max_num_tokens, is_profile=True) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (EngineCore_DP0 pid=203469) return func(*args, **kwargs) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3152, in _dummy_run (EngineCore_DP0 pid=203469) outputs = self.model( (EngineCore_DP0 pid=203469) ^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 121, in __call__ (EngineCore_DP0 pid=203469) return self.runnable(*args, **kwargs) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl (EngineCore_DP0 pid=203469) return self._call_impl(*args, **kwargs) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl (EngineCore_DP0 pid=203469) return forward_call(*args, **kwargs) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1450, in forward (EngineCore_DP0 pid=203469) hidden_states = self.language_model.model( (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 310, in __call__ (EngineCore_DP0 pid=203469) output = self.compiled_callable(*args, **kwargs) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 736, in compile_wrapper (EngineCore_DP0 pid=203469) return fn(*args, **kwargs) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 341, in forward (EngineCore_DP0 pid=203469) def forward( (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 375, in __call__ (EngineCore_DP0 pid=203469) return super().__call__(*args, **kwargs) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl (EngineCore_DP0 pid=203469) return self._call_impl(*args, **kwargs) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl (EngineCore_DP0 pid=203469) return forward_call(*args, **kwargs) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 929, in _fn (EngineCore_DP0 pid=203469) return fn(*args, **kwargs) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 848, in call_wrapped (EngineCore_DP0 pid=203469) return self._wrapped_call(self, *args, **kwargs) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 424, in __call__ (EngineCore_DP0 pid=203469) raise e (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 411, in __call__ (EngineCore_DP0 pid=203469) return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc] (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl (EngineCore_DP0 pid=203469) return self._call_impl(*args, **kwargs) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl (EngineCore_DP0 pid=203469) return forward_call(*args, **kwargs) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "<eval_with_key>.58", line 317, in forward (EngineCore_DP0 pid=203469) submod_0 = self.submod_0(l_inputs_embeds_, s59, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, s18, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_bias_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_, l_positions_, s7); l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_bias_ = None (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 121, in __call__ (EngineCore_DP0 pid=203469) return self.runnable(*args, **kwargs) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_piecewise_backend.py", line 90, in __call__ (EngineCore_DP0 pid=203469) return self.compiled_graph_for_general_shape(*args) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 929, in _fn (EngineCore_DP0 pid=203469) return fn(*args, **kwargs) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1241, in forward (EngineCore_DP0 pid=203469) return compiled_fn(full_args) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 384, in runtime_wrapper (EngineCore_DP0 pid=203469) all_outs = call_func_at_runtime_with_args( (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args (EngineCore_DP0 pid=203469) out = normalize_as_list(f(args)) (EngineCore_DP0 pid=203469) ^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 750, in inner_fn (EngineCore_DP0 pid=203469) outs = compiled_fn(args) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 556, in wrapper (EngineCore_DP0 pid=203469) return compiled_fn(runtime_args) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 584, in __call__ (EngineCore_DP0 pid=203469) return self.current_callable(inputs) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2716, in run (EngineCore_DP0 pid=203469) out = model(new_inputs) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) File "/root/.cache/vllm/torch_compile_cache/b2b9294305/rank_0_0/inductor_cache/vu/cvut3jsiuhyscibpddl2xday6ys7ukt3gdq6tpn3lz2o37xwys7n.py", line 540, in call (EngineCore_DP0 pid=203469) torch.ops._C.cutlass_scaled_mm.default(buf7, buf0, arg4_1, buf2, arg5_1, arg6_1) (EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 829, in __call__ (EngineCore_DP0 pid=203469) return self._op(*args, **kwargs) (EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=203469) RuntimeError: Int8 not supported for this architecture [rank0]:[W1117 20:00:20.593756440 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) (APIServer pid=203326) Traceback (most recent call last): (APIServer pid=203326) File "<frozen runpy>", line 198, in _run_module_as_main (APIServer pid=203326) File "<frozen runpy>", line 88, in _run_code (APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1953, in <module> (APIServer pid=203326) uvloop.run(run_server(args)) (APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run (APIServer pid=203326) return __asyncio.run( (APIServer pid=203326) ^^^^^^^^^^^^^^ (APIServer pid=203326) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run (APIServer pid=203326) return runner.run(main) (APIServer pid=203326) ^^^^^^^^^^^^^^^^ (APIServer pid=203326) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run (APIServer pid=203326) return self._loop.run_until_complete(task) (APIServer pid=203326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=203326) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper (APIServer pid=203326) return await main (APIServer pid=203326) ^^^^^^^^^^ (APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1884, in run_server (APIServer pid=203326) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1902, in run_server_worker (APIServer pid=203326) async with build_async_engine_client( (APIServer pid=203326) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=203326) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__ (APIServer pid=203326) return await anext(self.gen) (APIServer pid=203326) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 180, in build_async_engine_client (APIServer pid=203326) async with build_async_engine_client_from_engine_args( (APIServer pid=203326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=203326) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__ (APIServer pid=203326) return await anext(self.gen) (APIServer pid=203326) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 225, in build_async_engine_client_from_engine_args (APIServer pid=203326) async_llm = AsyncLLM.from_vllm_config( (APIServer pid=203326) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 1572, in inner (APIServer pid=203326) return fn(*args, **kwargs) (APIServer pid=203326) ^^^^^^^^^^^^^^^^^^^ (APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 207, in from_vllm_config (APIServer pid=203326) return cls( (APIServer pid=203326) ^^^^ (APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 134, in __init__ (APIServer pid=203326) self.engine_core = EngineCoreClient.make_async_mp_client( (APIServer pid=203326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client (APIServer pid=203326) return AsyncMPClient(*client_args) (APIServer pid=203326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 769, in __init__ (APIServer pid=203326) super().__init__( (APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 448, in __init__ (APIServer pid=203326) with launch_core_engines(vllm_config, executor_class, (APIServer pid=203326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=203326) File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__ (APIServer pid=203326) next(self.gen) (APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 732, in launch_core_engines (APIServer pid=203326) wait_for_engine_startup( (APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 785, in wait_for_engine_startup (APIServer pid=203326) raise RuntimeError("Engine core initialization failed. " (APIServer pid=203326) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} ``` ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

RunLLM · November 19, 2025, 9:16am

vLLM当前的INT8量化（W8A8）在RTX 5090（Blackwell架构，SM 120）上不被支持，因此会报“RuntimeError: Int8 not supported for this architecture”。但在RTX 4090（Ada架构，SM 89）上是支持的，所以能正常运行。Blackwell架构需使用FP8量化，INT8支持尚未实现，详见官方文档和相关issue。

如需在RTX 5090上运行量化模型，建议使用FP8量化模型，或等待vLLM后续对Blackwell架构的INT8支持更新。官方说明已明确INT8不支持Blackwell，建议切换量化方案。

需要更详细的技术分析或替代方案吗？

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
RTX PRO6000运行Qwen3-32B-FP8量化版本出错 General	1	414	July 4, 2025
Can anyone help me? Why is this not working? It used 😭 NVIDIA GPU Support	1	1041	May 8, 2025
Support for RTX 6000 Blackwell 96GB card NVIDIA GPU Support	5	4402	January 5, 2026
MoE quantization Quantization	9	1182	July 2, 2025
RTX Pro 6000运行Qwen3-32B-FP8报错 General	1	550	July 4, 2025

RuntimeError: Int8 not supported on SM120. Use FP8 quantization instead, or run on older arch (SM < 100)

Related topics