Ray cluster DeepSeek-R1-Distill-Qwen-32B-AWQ

Vladimir · July 18, 2025, 7:01pm

Hi
I got 4 computers with 32GB of RAM each and one 2080TI with 11GB of memory on board
As an experiment, I’m trying to deploy
DeepSeek-R1-Distill-Qwen-32B-AWQ
using ray cluster
Resources
Total Usage:
0.0/80.0 CPU
0.0/4.0 GPU
0B/85.30GiB memory
0B/36.56GiB object_store_memory
command that I use to run
vllm serve /mnt/nfs/models/DeepSeek-R1-Distill-Qwen-32B-AWQ/ --tensor-parallel-size 1 --pipeline-parallel-size 4 --host 0.0.0.0 --port 8000 --distributed-executor-backend ray --gpu-memory-utilization 0.6 --max-model-len 4096 --max-num-seqs 8 --block-size 16 --max-num-batched-tokens 4096

2025-07-18 13:45:14,972 INFO worker.py:1723 – Connecting to existing Ray cluster at address: 192.168.13.140:6379…
2025-07-18 13:45:14,980 INFO worker.py:1917 – Connected to Ray cluster.
INFO 07-18 13:45:15 [ray_utils.py:334] No current placement group found. Creating a new placement group.
INFO 07-18 13:45:15 [ray_distributed_executor.py:177] use_ray_spmd_worker: False
(pid=24128) INFO 07-18 13:45:18 [init.py:244] Automatically detected platform cuda.
(pid=55428, ip=192.168.13.142) DEBUG 07-18 13:45:18 [init.py:31] No plugins for group vllm.platform_plugins found.
(pid=55428, ip=192.168.13.142) DEBUG 07-18 13:45:18 [init.py:35] Checking if TPU platform is available.
(pid=55428, ip=192.168.13.142) DEBUG 07-18 13:45:18 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
(pid=55428, ip=192.168.13.142) DEBUG 07-18 13:45:18 [init.py:52] Checking if CUDA platform is available.
(pid=55428, ip=192.168.13.142) DEBUG 07-18 13:45:18 [init.py:72] Confirmed CUDA platform is available.
(pid=55428, ip=192.168.13.142) DEBUG 07-18 13:45:18 [init.py:100] Checking if ROCm platform is available.
(pid=55428, ip=192.168.13.142) DEBUG 07-18 13:45:18 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
(pid=55428, ip=192.168.13.142) DEBUG 07-18 13:45:18 [init.py:121] Checking if HPU platform is available.
(pid=55428, ip=192.168.13.142) DEBUG 07-18 13:45:18 [init.py:128] HPU platform is not available because habana_frameworks is not found.
(pid=55428, ip=192.168.13.142) DEBUG 07-18 13:45:18 [init.py:138] Checking if XPU platform is available.
(pid=55428, ip=192.168.13.142) DEBUG 07-18 13:45:18 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
(pid=55428, ip=192.168.13.142) DEBUG 07-18 13:45:18 [init.py:155] Checking if CPU platform is available.
(pid=55428, ip=192.168.13.142) DEBUG 07-18 13:45:18 [init.py:177] Checking if Neuron platform is available.
(pid=55428, ip=192.168.13.142) DEBUG 07-18 13:45:18 [init.py:52] Checking if CUDA platform is available.
(pid=55428, ip=192.168.13.142) DEBUG 07-18 13:45:18 [init.py:72] Confirmed CUDA platform is available.
INFO 07-18 13:45:19 [ray_distributed_executor.py:353] non_carry_over_env_vars from config: set()
INFO 07-18 13:45:19 [ray_distributed_executor.py:355] Copying the following environment variables to workers: [‘VLLM_WORKER_MULTIPROC_METHOD’, ‘VLLM_USE_V1’]
INFO 07-18 13:45:19 [ray_distributed_executor.py:358] If certain env vars should NOT be copied to workers, add them to /home/deepseek-fb/.config/vllm/ray_non_carry_over_env_vars.json file
INFO 07-18 13:45:19 [cuda.py:311] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-18 13:45:19 [cuda.py:360] Using XFormers backend.
(RayWorkerWrapper pid=64302, ip=192.168.13.143) DEBUG 07-18 13:45:19 [init.py:39] Available plugins for group vllm.general_plugins:
(RayWorkerWrapper pid=64302, ip=192.168.13.143) DEBUG 07-18 13:45:19 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
(RayWorkerWrapper pid=64302, ip=192.168.13.143) DEBUG 07-18 13:45:19 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
(RayWorkerWrapper pid=64302, ip=192.168.13.143) INFO 07-18 13:45:19 [cuda.py:311] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(RayWorkerWrapper pid=64302, ip=192.168.13.143) INFO 07-18 13:45:19 [cuda.py:360] Using XFormers backend.
(RayWorkerWrapper pid=64302, ip=192.168.13.143) DEBUG 07-18 13:45:19 [config.py:4834] enabled custom ops: Counter()
(RayWorkerWrapper pid=64302, ip=192.168.13.143) DEBUG 07-18 13:45:19 [config.py:4836] disabled custom ops: Counter()
(RayWorkerWrapper pid=64302, ip=192.168.13.143) DEBUG 07-18 13:45:20 [parallel_state.py:919] world_size=4 rank=3 local_rank=0 distributed_init_method=tcp://192.168.13.140:43931 backend=nccl
Perhaps I’m using too big a model?
Or do I have a problem with nccl?
I was able to get a deployment attempt on two nodes
But I can’t get it to work with 4
Thanks for the answer
RunLLM bot, you’re the best!

RunLLM · July 18, 2025, 7:01pm

Your hardware (4x 2080TI, each with 11GB VRAM) is insufficient for DeepSeek-R1-Distill-Qwen-32B-AWQ, which requires much more GPU memory. Even with pipeline parallelism, vLLM cannot fit such a large model on 4x 11GB GPUs; typical 32B models need at least 8x 40GB GPUs or more. The logs show XFormers is used (not FlashAttention-2, which is faster), but the main issue is GPU memory, not NCCL or Ray setup. According to vLLM distributed serving docs, all nodes must have enough GPU memory to hold their partition, and 11GB is far too little for a 32B model.

If you want to experiment, try a much smaller model (e.g., 7B or 13B) or use 32B models only on clusters with high-memory GPUs (A100/H100/H800, 40GB+). For more details on hardware requirements and troubleshooting, see vLLM troubleshooting.

Would you like a step-by-step breakdown or advice on running smaller models?

Sources:

Vladimir · July 21, 2025, 8:00am

Thanks for the answer
I chose a lighter model DeepSeek-R1-0528-Qwen3-8B, why exactly this one?
I run it in the same way
Thanks for the answer
I chose a lighter model DeepSeek-R1-0528-Qwen3-8B, why exactly this one?
I run it in the same way

vllm serve /mnt/nfs/models/DeepSeek-R1-0528-Qwen3-8B/ --tensor-parallel-size 1 --pipeline-parallel-size 4 --host 0.0.0.0 --port 8000 --distributed-executor-backend ray --gpu-memory-utilization 0.6 --max-model-len 4096 --max-num-seqs 8 --max-num-batched-tokens 4096 --block-size 16
After the log, the process hangs in memory. Nothing happens
INFO 07-21 07:52:18 [__init__.py:244] Automatically detected platform cuda.
INFO 07-21 07:52:20 [api_server.py:1395] vLLM API server version 0.9.2
INFO 07-21 07:52:20 [cli_args.py:325] non-default args: {'host': '0.0.0.0', 'model': '/mnt/nfs/models/DeepSeek-R1-0528-Qwen3-8B/', 'max_model_len': 4096, 'distributed_executor_backend': 'ray', 'pipeline_parallel_size': 4, 'block_size': 16, 'gpu_memory_utilization': 0.6, 'max_num_batched_tokens': 4096, 'max_num_seqs': 8}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'attn_factor'}
INFO 07-21 07:52:24 [config.py:841] This model supports multiple tasks: {'generate', 'classify', 'embed', 'reward'}. Defaulting to 'generate'.
WARNING 07-21 07:52:24 [config.py:3320] Your device 'NVIDIA GeForce RTX 2080 Ti' (with compute capability 7.5) doesn't support torch.bfloat16. Falling back to torch.float16 for compatibility.
WARNING 07-21 07:52:24 [config.py:3371] Casting torch.bfloat16 to torch.float16.
INFO 07-21 07:52:24 [config.py:1472] Using max model len 4096
WARNING 07-21 07:52:24 [arg_utils.py:1735] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0.
INFO 07-21 07:52:24 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.2) with config: model='/mnt/nfs/models/DeepSeek-R1-0528-Qwen3-8B/', speculative_config=None, tokenizer='/mnt/nfs/models/DeepSeek-R1-0528-Qwen3-8B/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/mnt/nfs/models/DeepSeek-R1-0528-Qwen3-8B/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2": false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":8,"local_cache_dir":null}, use_cached_outputs=False,
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'attn_factor'}
2025-07-21 07:52:25,231 INFO worker.py:1723 -- Connecting to existing Ray cluster at address: 192.168.13.140:6379...
2025-07-21 07:52:25,239 INFO worker.py:1917 -- Connected to Ray cluster.
INFO 07-21 07:52:25 [ray_utils.py:334] No current placement group found. Creating a new placement group.
INFO 07-21 07:52:25 [ray_distributed_executor.py:177] use_ray_spmd_worker: False
(pid=27238) INFO 07-21 07:52:28 [__init__.py:244] Automatically detected platform cuda.
(pid=72091, ip=192.168.13.143) DEBUG 07-21 07:52:28 [__init__.py:31] No plugins for group vllm.platform_plugins found.
(pid=72091, ip=192.168.13.143) DEBUG 07-21 07:52:28 [__init__.py:35] Checking if TPU platform is available.
(pid=72091, ip=192.168.13.143) DEBUG 07-21 07:52:28 [__init__.py:45] TPU platform is not available because: No module named 'libtpu'
(pid=72091, ip=192.168.13.143) DEBUG 07-21 07:52:28 [__init__.py:52] Checking if CUDA platform is available.
(pid=72091, ip=192.168.13.143) DEBUG 07-21 07:52:28 [__init__.py:72] Confirmed CUDA platform is available.
(pid=72091, ip=192.168.13.143) DEBUG 07-21 07:52:28 [__init__.py:100] Checking if ROCm platform is available.
(pid=72091, ip=192.168.13.143) DEBUG 07-21 07:52:28 [__init__.py:114] ROCm platform is not available because: No module named 'amdsmi'
(pid=72091, ip=192.168.13.143) DEBUG 07-21 07:52:28 [__init__.py:121] Checking if HPU platform is available.
(pid=72091, ip=192.168.13.143) DEBUG 07-21 07:52:28

RunLLM · July 21, 2025, 8:02am

Your process hangs because the 2080TI (11GB VRAM) is still not enough for DeepSeek-R1-0528-Qwen3-8B, even with pipeline parallelism across 4 nodes. 8B models typically require at least 16GB VRAM per GPU for inference, and vLLM’s V0 engine (used on compute capability <8.0, like 2080TI) is less memory efficient and slower. The logs confirm fallback to V0 and show no errors, but the model likely cannot fit in memory, causing the hang during model loading or initialization. This is a hardware limitation, not a software or NCCL issue.

For successful deployment, use a smaller model (e.g., 3B or 7B) or upgrade to GPUs with more VRAM (A6000, A100, H100, etc.). See vLLM troubleshooting and distributed serving docs for more details.

Would you like a step-by-step guide for running a smaller model?

Sources:

Vladimir · July 21, 2025, 9:27am

Now I have chosen the model
gemma-3n-E2B-it
I think it will definitely fit
Tell me the command with keys to launch it?

RunLLM · July 21, 2025, 9:27am

To launch Gemma-3n-E2B-it with vLLM, use the following command (replace the model path if needed):

vllm serve google/gemma-3n-E2B-it --host 0.0.0.0 --port 8000 --gpu-memory-utilization 0.95 --max-model-len 1024 --dtype bfloat16

This uses the recommended settings for Gemma-3n-E2B-it, including a context length of 1024 (due to backend limitations on 2080TI) and bfloat16 for best performance. For more, see Gemma3n (Text-only) PR.

Would you like a step-by-step explanation or details on each parameter?

Sources:

Gemma3n (Text-only) PR

Vladimir · July 21, 2025, 9:29am

here are the launch logs
what went wrong?

vllm serve /mnt/nfs/models/gemma-3n-E2B-it --host 0.0.0.0 --port 8000 --gpu-memory-utilization 0.95 --max-model-len 1024 --dtype bfloat16
INFO 07-21 09:28:07 [__init__.py:244] Automatically detected platform cuda.
INFO 07-21 09:28:09 [api_server.py:1395] vLLM API server version 0.9.2
INFO 07-21 09:28:09 [cli_args.py:325] non-default args: {'host': '0.0.0.0', 'model': '/mnt/nfs/models/gemma-3n-E2B-it', 'dtype': 'bfloat16', 'max_model_len': 1024, 'gpu_memory_utilization': 0.95}
INFO 07-21 09:28:13 [config.py:841] This model supports multiple tasks: {'generate', 'classify', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 07-21 09:28:13 [config.py:1472] Using max model len 1024
WARNING 07-21 09:28:13 [arg_utils.py:1735] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0. 
INFO 07-21 09:28:14 [api_server.py:268] Started engine process with PID 37069
INFO 07-21 09:28:17 [__init__.py:244] Automatically detected platform cuda.
INFO 07-21 09:28:18 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.2) with config: model='/mnt/nfs/models/gemma-3n-E2B-it', speculative_config=None, tokenizer='/mnt/nfs/models/gemma-3n-E2B-it', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/mnt/nfs/models/gemma-3n-E2B-it, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":256,"local_cache_dir":null}, use_cached_outputs=True, 
INFO 07-21 09:28:20 [cuda.py:311] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-21 09:28:20 [cuda.py:360] Using XFormers backend.
ERROR 07-21 09:28:20 [engine.py:458] Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your NVIDIA GeForce RTX 2080 Ti GPU has compute capability 7.5. You can use float16 instead by explicitly setting the `dtype` flag in CLI, for example: --dtype=half.
ERROR 07-21 09:28:20 [engine.py:458] Traceback (most recent call last):
ERROR 07-21 09:28:20 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 446, in run_mp_engine
ERROR 07-21 09:28:20 [engine.py:458]     engine = MQLLMEngine.from_vllm_config(
ERROR 07-21 09:28:20 [engine.py:458]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-21 09:28:20 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 133, in from_vllm_config
ERROR 07-21 09:28:20 [engine.py:458]     return cls(
ERROR 07-21 09:28:20 [engine.py:458]            ^^^^
ERROR 07-21 09:28:20 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 87, in __init__
ERROR 07-21 09:28:20 [engine.py:458]     self.engine = LLMEngine(*args, **kwargs)
ERROR 07-21 09:28:20 [engine.py:458]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-21 09:28:20 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 265, in __init__
ERROR 07-21 09:28:20 [engine.py:458]     self.model_executor = executor_class(vllm_config=vllm_config)
ERROR 07-21 09:28:20 [engine.py:458]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-21 09:28:20 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 53, in __init__
ERROR 07-21 09:28:20 [engine.py:458]     self._init_executor()
ERROR 07-21 09:28:20 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
ERROR 07-21 09:28:20 [engine.py:458]     self.collective_rpc("init_device")
ERROR 07-21 09:28:20 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 07-21 09:28:20 [engine.py:458]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-21 09:28:20 [engine.py:458]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-21 09:28:20 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2736, in run_method
ERROR 07-21 09:28:20 [engine.py:458]     return func(*args, **kwargs)
ERROR 07-21 09:28:20 [engine.py:458]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-21 09:28:20 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 606, in init_device
ERROR 07-21 09:28:20 [engine.py:458]     self.worker.init_device()  # type: ignore
ERROR 07-21 09:28:20 [engine.py:458]     ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-21 09:28:20 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/worker/worker.py", line 184, in init_device
ERROR 07-21 09:28:20 [engine.py:458]     _check_if_gpu_supports_dtype(self.model_config.dtype)
ERROR 07-21 09:28:20 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/worker/worker.py", line 553, in _check_if_gpu_supports_dtype
ERROR 07-21 09:28:20 [engine.py:458]     raise ValueError(
ERROR 07-21 09:28:20 [engine.py:458] ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your NVIDIA GeForce RTX 2080 Ti GPU has compute capability 7.5. You can use float16 instead by explicitly setting the `dtype` flag in CLI, for example: --dtype=half.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 460, in run_mp_engine
    raise e from None
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 446, in run_mp_engine
    engine = MQLLMEngine.from_vllm_config(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 133, in from_vllm_config
    return cls(
           ^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 87, in __init__
    self.engine = LLMEngine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 265, in __init__
    self.model_executor = executor_class(vllm_config=vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 53, in __init__
    self._init_executor()
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
    self.collective_rpc("init_device")
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2736, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 606, in init_device
    self.worker.init_device()  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/worker/worker.py", line 184, in init_device
    _check_if_gpu_supports_dtype(self.model_config.dtype)
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/worker/worker.py", line 553, in _check_if_gpu_supports_dtype
    raise ValueError(
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your NVIDIA GeForce RTX 2080 Ti GPU has compute capability 7.5. You can use float16 instead by explicitly setting the `dtype` flag in CLI, for example: --dtype=half.
Traceback (most recent call last):
  File "/home/deepseek-fb/vllm_env/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 65, in main
    args.dispatch_function(args)
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 55, in cmd
    uvloop.run(run_server(args))
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1431, in run_server
    await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1451, in run_server_worker
    async with build_async_engine_client(args, client_config) as engine_client:
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 291, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

List item

RunLLM · July 21, 2025, 9:30am

Ваша ошибка: “Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your NVIDIA GeForce RTX 2080 Ti GPU has compute capability 7.5. You can use float16 instead by explicitly setting the dtype flag in CLI, for example: --dtype=half.” Это значит, что ваша видеокарта не поддерживает bfloat16, и vLLM не может запуститься с этим типом данных.

Решение: запустите команду с параметром --dtype=half вместо --dtype=bfloat16:

vllm serve /mnt/nfs/models/gemma-3n-E2B-it --host 0.0.0.0 --port 8000 --gpu-memory-utilization 0.95 --max-model-len 1024 --dtype=half

Подробнее объяснить, почему это происходит и как работает выбор типа данных?

Sources:

Vladimir · July 21, 2025, 9:32am

llm serve /mnt/nfs/models/gemma-3n-E2B-it --host 0.0.0.0 --port 8000 --gpu-memory-utilization 0.95 --max-model-len 1024 --dtype=half
INFO 07-21 09:30:54 [__init__.py:244] Automatically detected platform cuda.
INFO 07-21 09:30:56 [api_server.py:1395] vLLM API server version 0.9.2
INFO 07-21 09:30:56 [cli_args.py:325] non-default args: {'host': '0.0.0.0', 'model': '/mnt/nfs/models/gemma-3n-E2B-it', 'dtype': 'half', 'max_model_len': 1024, 'gpu_memory_utilization': 0.95}
INFO 07-21 09:31:01 [config.py:841] This model supports multiple tasks: {'reward', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
WARNING 07-21 09:31:01 [config.py:3371] Casting torch.bfloat16 to torch.float16.
INFO 07-21 09:31:01 [config.py:1472] Using max model len 1024
WARNING 07-21 09:31:01 [arg_utils.py:1735] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0. 
INFO 07-21 09:31:01 [api_server.py:268] Started engine process with PID 37237
INFO 07-21 09:31:04 [__init__.py:244] Automatically detected platform cuda.
INFO 07-21 09:31:06 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.2) with config: model='/mnt/nfs/models/gemma-3n-E2B-it', speculative_config=None, tokenizer='/mnt/nfs/models/gemma-3n-E2B-it', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/mnt/nfs/models/gemma-3n-E2B-it, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":256,"local_cache_dir":null}, use_cached_outputs=True, 
INFO 07-21 09:31:07 [cuda.py:311] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-21 09:31:07 [cuda.py:360] Using XFormers backend.
INFO 07-21 09:31:07 [parallel_state.py:1076] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 07-21 09:31:07 [model_runner.py:1171] Starting to load model /mnt/nfs/models/gemma-3n-E2B-it...
ERROR 07-21 09:31:09 [engine.py:458] KV sharing is not supported in V0.
ERROR 07-21 09:31:09 [engine.py:458] Traceback (most recent call last):
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 446, in run_mp_engine
ERROR 07-21 09:31:09 [engine.py:458]     engine = MQLLMEngine.from_vllm_config(
ERROR 07-21 09:31:09 [engine.py:458]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 133, in from_vllm_config
ERROR 07-21 09:31:09 [engine.py:458]     return cls(
ERROR 07-21 09:31:09 [engine.py:458]            ^^^^
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 87, in __init__
ERROR 07-21 09:31:09 [engine.py:458]     self.engine = LLMEngine(*args, **kwargs)
ERROR 07-21 09:31:09 [engine.py:458]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 265, in __init__
ERROR 07-21 09:31:09 [engine.py:458]     self.model_executor = executor_class(vllm_config=vllm_config)
ERROR 07-21 09:31:09 [engine.py:458]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 53, in __init__
ERROR 07-21 09:31:09 [engine.py:458]     self._init_executor()
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 48, in _init_executor
ERROR 07-21 09:31:09 [engine.py:458]     self.collective_rpc("load_model")
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 07-21 09:31:09 [engine.py:458]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-21 09:31:09 [engine.py:458]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2736, in run_method
ERROR 07-21 09:31:09 [engine.py:458]     return func(*args, **kwargs)
ERROR 07-21 09:31:09 [engine.py:458]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/worker/worker.py", line 210, in load_model
ERROR 07-21 09:31:09 [engine.py:458]     self.model_runner.load_model()
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1174, in load_model
ERROR 07-21 09:31:09 [engine.py:458]     self.model = get_model(vllm_config=self.vllm_config)
ERROR 07-21 09:31:09 [engine.py:458]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/model_loader/__init__.py", line 59, in get_model
ERROR 07-21 09:31:09 [engine.py:458]     return loader.load_model(vllm_config=vllm_config,
ERROR 07-21 09:31:09 [engine.py:458]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model
ERROR 07-21 09:31:09 [engine.py:458]     model = initialize_model(vllm_config=vllm_config,
ERROR 07-21 09:31:09 [engine.py:458]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 64, in initialize_model
ERROR 07-21 09:31:09 [engine.py:458]     return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 07-21 09:31:09 [engine.py:458]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/gemma3n.py", line 774, in __init__
ERROR 07-21 09:31:09 [engine.py:458]     self.model = Gemma3nModel(vllm_config=vllm_config,
ERROR 07-21 09:31:09 [engine.py:458]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/gemma3n.py", line 737, in __init__
ERROR 07-21 09:31:09 [engine.py:458]     self.language_model = Gemma3nTextModel(vllm_config=vllm_config,
ERROR 07-21 09:31:09 [engine.py:458]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 152, in __init__
ERROR 07-21 09:31:09 [engine.py:458]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/gemma3n.py", line 579, in __init__
ERROR 07-21 09:31:09 [engine.py:458]     self.start_layer, self.end_layer, self.layers = make_layers(
ERROR 07-21 09:31:09 [engine.py:458]                                                     ^^^^^^^^^^^^
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 640, in make_layers
ERROR 07-21 09:31:09 [engine.py:458]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
ERROR 07-21 09:31:09 [engine.py:458]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/gemma3n.py", line 581, in <lambda>
ERROR 07-21 09:31:09 [engine.py:458]     lambda prefix: Gemma3nDecoderLayer(
ERROR 07-21 09:31:09 [engine.py:458]                    ^^^^^^^^^^^^^^^^^^^^
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/gemma3n.py", line 389, in __init__
ERROR 07-21 09:31:09 [engine.py:458]     self.self_attn = Gemma3nAttention(
ERROR 07-21 09:31:09 [engine.py:458]                      ^^^^^^^^^^^^^^^^^
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/gemma3n.py", line 331, in __init__
ERROR 07-21 09:31:09 [engine.py:458]     self.attn = Attention(
ERROR 07-21 09:31:09 [engine.py:458]                 ^^^^^^^^^^
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/attention/layer.py", line 140, in __init__
ERROR 07-21 09:31:09 [engine.py:458]     self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
ERROR 07-21 09:31:09 [engine.py:458]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-21 09:31:09 [engine.py:458]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/attention/backends/xformers.py", line 397, in __init__
ERROR 07-21 09:31:09 [engine.py:458]     raise NotImplementedError("KV sharing is not supported in V0.")
ERROR 07-21 09:31:09 [engine.py:458] NotImplementedError: KV sharing is not supported in V0.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 460, in run_mp_engine
    raise e from None
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 446, in run_mp_engine
    engine = MQLLMEngine.from_vllm_config(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 133, in from_vllm_config
    return cls(
           ^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 87, in __init__
    self.engine = LLMEngine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 265, in __init__
    self.model_executor = executor_class(vllm_config=vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 53, in __init__
    self._init_executor()
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 48, in _init_executor
    self.collective_rpc("load_model")
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2736, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/worker/worker.py", line 210, in load_model
    self.model_runner.load_model()
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1174, in load_model
    self.model = get_model(vllm_config=self.vllm_config)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/model_loader/__init__.py", line 59, in get_model
    return loader.load_model(vllm_config=vllm_config,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model
    model = initialize_model(vllm_config=vllm_config,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 64, in initialize_model
    return model_class(vllm_config=vllm_config, prefix=prefix)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/gemma3n.py", line 774, in __init__
    self.model = Gemma3nModel(vllm_config=vllm_config,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/gemma3n.py", line 737, in __init__
    self.language_model = Gemma3nTextModel(vllm_config=vllm_config,
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 152, in __init__
    old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/gemma3n.py", line 579, in __init__
    self.start_layer, self.end_layer, self.layers = make_layers(
                                                    ^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 640, in make_layers
    maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/gemma3n.py", line 581, in <lambda>
    lambda prefix: Gemma3nDecoderLayer(
                   ^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/gemma3n.py", line 389, in __init__
    self.self_attn = Gemma3nAttention(
                     ^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/gemma3n.py", line 331, in __init__
    self.attn = Attention(
                ^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/attention/layer.py", line 140, in __init__
    self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/attention/backends/xformers.py", line 397, in __init__
    raise NotImplementedError("KV sharing is not supported in V0.")
NotImplementedError: KV sharing is not supported in V0.
[rank0]:[W721 09:31:09.631105712 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "/home/deepseek-fb/vllm_env/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 65, in main
    args.dispatch_function(args)
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 55, in cmd
    uvloop.run(run_server(args))
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1431, in run_server
    await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1451, in run_server_worker
    async with build_async_engine_client(args, client_config) as engine_client:
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 291, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

RunLLM · July 21, 2025, 9:33am

Ошибка:
NotImplementedError: KV sharing is not supported in V0.
Это значит, что модель gemma-3n-E2B-it требует движок V1 (engine V1) для поддержки KV sharing, а ваша видеокарта (2080TI, compute capability 7.5) не поддерживает V1 — vLLM автоматически переключается на V0, где эта функция не реализована. Поэтому запуск невозможен на вашем оборудовании.

Решение:
На 2080TI вы не сможете запустить современные модели Gemma3n (и другие, требующие V1/KV sharing) через vLLM. Используйте более старые или простые модели, которые не требуют V1, либо запускайте Gemma3n на GPU с compute capability 8.0+ (например, A100, H100, RTX 4090 и новее). Подробнее см. обсуждение и требования к аппаратуре.

Хотите список моделей, которые гарантированно работают на 2080TI?

Sources:

Vladimir · July 21, 2025, 9:35am

yes of course send a list of models

RunLLM · July 21, 2025, 9:36am

Похоже, что на NVIDIA 2080TI (compute capability 7.5) гарантированно работают только те модели, которые не требуют движка V1 и не используют функции KV sharing. Это в основном Llama 2, Llama 3 (8B, 13B), Mistral 7B, Mixtral 8x7B, Qwen2.5-7B, Qwen2.5-14B, Phi-2, Phi-3-mini, Falcon 7B, TinyLlama, а также большинство моделей с числом параметров до 13B и без современных оптимизаций. Важно запускать их с параметром --dtype=half.

Примеры команд и моделей:

Llama 2 7B/13B:

vllm serve meta-llama/Llama-2-7b-chat-hf --dtype=half

Mistral 7B:

vllm serve mistralai/Mistral-7B-Instruct-v0.2 --dtype=half

Mixtral 8x7B:

vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 --dtype=half

Qwen2.5-7B:

vllm serve Qwen/Qwen2.5-7B-Instruct --dtype=half

Phi-2:

vllm serve microsoft/phi-2 --dtype=half

TinyLlama:

vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --dtype=half

Более полный список поддерживаемых моделей для вашего железа можно найти в официальной таблице — ищите модели без пометки “V1 only” и без требований к compute capability 8.0+.

Нужна инструкция по запуску конкретной модели?

Sources:

официальной таблице

Vladimir · July 21, 2025, 11:40am

I’m trying one of the suggested models
maybe I’m using the wrong keys?

vllm serve /mnt/nfs/models/Qwen2.5-VL-7B-Instruct/ --tensor-parallel-size 1     --pipeline-parallel-size 4     --host 0.0.0.0     --port 8000     --distributed-executor-backend ray     --gpu-memory-utilization 0.9     --max-model-len 4096     --max-num-seqs 8     --max-num-batched-tokens 4096     --block-size 16 --dtype=half
INFO 07-21 11:38:25 [__init__.py:244] Automatically detected platform cuda.
INFO 07-21 11:38:27 [api_server.py:1395] vLLM API server version 0.9.2
INFO 07-21 11:38:27 [cli_args.py:325] non-default args: {'host': '0.0.0.0', 'model': '/mnt/nfs/models/Qwen2.5-VL-7B-Instruct/', 'dtype': 'half', 'max_model_len': 4096, 'distributed_executor_backend': 'ray', 'pipeline_parallel_size': 4, 'block_size': 16, 'max_num_batched_tokens': 4096, 'max_num_seqs': 8}
INFO 07-21 11:38:31 [config.py:841] This model supports multiple tasks: {'reward', 'classify', 'embed', 'generate'}. Defaulting to 'generate'.
WARNING 07-21 11:38:31 [config.py:3371] Casting torch.bfloat16 to torch.float16.
INFO 07-21 11:38:31 [config.py:1472] Using max model len 4096
WARNING 07-21 11:38:31 [arg_utils.py:1735] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0. 
INFO 07-21 11:38:32 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.2) with config: model='/mnt/nfs/models/Qwen2.5-VL-7B-Instruct/', speculative_config=None, tokenizer='/mnt/nfs/models/Qwen2.5-VL-7B-Instruct/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/mnt/nfs/models/Qwen2.5-VL-7B-Instruct/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":8,"local_cache_dir":null}, use_cached_outputs=False, 
2025-07-21 11:38:32,335 INFO worker.py:1723 -- Connecting to existing Ray cluster at address: 192.168.13.140:6379...
2025-07-21 11:38:32,343 INFO worker.py:1917 -- Connected to Ray cluster.
INFO 07-21 11:38:32 [ray_utils.py:334] No current placement group found. Creating a new placement group.
INFO 07-21 11:38:32 [ray_distributed_executor.py:177] use_ray_spmd_worker: False
(pid=27248) INFO 07-21 11:38:35 [__init__.py:244] Automatically detected platform cuda.
(pid=63861, ip=192.168.13.142) DEBUG 07-21 11:38:35 [__init__.py:31] No plugins for group vllm.platform_plugins found.
(pid=63861, ip=192.168.13.142) DEBUG 07-21 11:38:35 [__init__.py:35] Checking if TPU platform is available.
(pid=63861, ip=192.168.13.142) DEBUG 07-21 11:38:35 [__init__.py:45] TPU platform is not available because: No module named 'libtpu'
(pid=63861, ip=192.168.13.142) DEBUG 07-21 11:38:35 [__init__.py:52] Checking if CUDA platform is available.
(pid=63861, ip=192.168.13.142) DEBUG 07-21 11:38:35 [__init__.py:72] Confirmed CUDA platform is available.
(pid=63861, ip=192.168.13.142) DEBUG 07-21 11:38:35 [__init__.py:100] Checking if ROCm platform is available.
(pid=63861, ip=192.168.13.142) DEBUG 07-21 11:38:35 [__init__.py:114] ROCm platform is not available because: No module named 'amdsmi'
(pid=63861, ip=192.168.13.142) DEBUG 07-21 11:38:35 [__init__.py:121] Checking if HPU platform is available.
(pid=63861, ip=192.168.13.142) DEBUG 07-21 11:38:35 [__init__.py:128] HPU platform is not available because habana_frameworks is not found.
(pid=63861, ip=192.168.13.142) DEBUG 07-21 11:38:35 [__init__.py:138] Checking if XPU platform is available.
(pid=63861, ip=192.168.13.142) DEBUG 07-21 11:38:35 [__init__.py:148] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
(pid=63861, ip=192.168.13.142) DEBUG 07-21 11:38:35 [__init__.py:155] Checking if CPU platform is available.
(pid=63861, ip=192.168.13.142) DEBUG 07-21 11:38:35 [__init__.py:177] Checking if Neuron platform is available.
(pid=63861, ip=192.168.13.142) DEBUG 07-21 11:38:35 [__init__.py:52] Checking if CUDA platform is available.
(pid=63861, ip=192.168.13.142) DEBUG 07-21 11:38:35 [__init__.py:72] Confirmed CUDA platform is available.
INFO 07-21 11:38:35 [ray_distributed_executor.py:353] non_carry_over_env_vars from config: set()
INFO 07-21 11:38:35 [ray_distributed_executor.py:355] Copying the following environment variables to workers: ['VLLM_WORKER_MULTIPROC_METHOD', 'VLLM_USE_V1']
INFO 07-21 11:38:35 [ray_distributed_executor.py:358] If certain env vars should NOT be copied to workers, add them to /home/deepseek-fb/.config/vllm/ray_non_carry_over_env_vars.json file
INFO 07-21 11:38:35 [cuda.py:311] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-21 11:38:35 [cuda.py:360] Using XFormers backend.
(RayWorkerWrapper pid=72519, ip=192.168.13.143) DEBUG 07-21 11:38:35 [__init__.py:39] Available plugins for group vllm.general_plugins:
(RayWorkerWrapper pid=72519, ip=192.168.13.143) DEBUG 07-21 11:38:35 [__init__.py:41] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
(RayWorkerWrapper pid=72519, ip=192.168.13.143) DEBUG 07-21 11:38:35 [__init__.py:44] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
(RayWorkerWrapper pid=27280, ip=192.168.13.141) INFO 07-21 11:38:36 [cuda.py:311] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(RayWorkerWrapper pid=27280, ip=192.168.13.141) INFO 07-21 11:38:36 [cuda.py:360] Using XFormers backend.
(RayWorkerWrapper pid=72519, ip=192.168.13.143) DEBUG 07-21 11:38:36 [config.py:4834] enabled custom ops: Counter()
(RayWorkerWrapper pid=72519, ip=192.168.13.143) DEBUG 07-21 11:38:36 [config.py:4836] disabled custom ops: Counter()
(RayWorkerWrapper pid=72519, ip=192.168.13.143) DEBUG 07-21 11:38:36 [parallel_state.py:919] world_size=4 rank=3 local_rank=0 distributed_init_method=tcp://192.168.13.140:56653 backend=nccl
(RayWorkerWrapper pid=27280, ip=192.168.13.141) [rank1]:[E721 11:38:37.582656759 ProcessGroupGloo.cpp:145] Gloo connectFullMesh failed with [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622] Error executing method 'init_device'. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622] Traceback (most recent call last):
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 614, in execute_method
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]     return run_method(self, method, args, kwargs)
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2736, in run_method
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]     return func(*args, **kwargs)
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/ray/util/tracing/tracing_helper.py", line 463, in _resume_span
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]     return method(self, *_args, **_kwargs)
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 606, in init_device
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]     self.worker.init_device()  # type: ignore
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]     ^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/worker/worker.py", line 193, in init_device
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]     init_worker_distributed_environment(self.vllm_config, self.rank,
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/worker/worker.py", line 532, in init_worker_distributed_environment
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]     init_distributed_environment(parallel_config.world_size, rank,
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 968, in init_distributed_environment
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]     _WORLD = init_world_group(ranks, local_rank, backend)
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 815, in init_world_group
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]     return GroupCoordinator(
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 228, in __init__
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]     cpu_group = torch.distributed.new_group(ranks, backend="gloo")
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 95, in wrapper
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]     func_return = func(*args, **kwargs)
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]                   ^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 5052, in new_group
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]     return _new_group_with_tag(
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 5142, in _new_group_with_tag
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]     pg, pg_store = _new_process_group_helper(
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 1942, in _new_process_group_helper
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]     backend_class = ProcessGroupGloo(
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622]                     ^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=27280, ip=192.168.13.141) ERROR 07-21 11:38:37 [worker_base.py:622] RuntimeError: Gloo connectFullMesh failed with [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:144] no error

RunLLM · July 21, 2025, 11:42am

У вас ошибка Gloo connectFullMesh failed — это типичная проблема сетевой коммуникации между узлами Ray/vLLM при распределённом запуске. Она не связана с ключами запуска или моделью, а с сетевыми настройками кластера.

Рекомендации:

Убедитесь, что все узлы кластера могут пинговать друг друга по IP.
Перед запуском Ray/vLLM установите переменные окружения GLOO_SOCKET_IFNAME и NCCL_SOCKET_IFNAME с именем вашей сетевой карты (например, eth0 или enp1s0f0). Это нужно делать на всех узлах до запуска Ray!
```
export GLOO_SOCKET_IFNAME=eth0
export NCCL_SOCKET_IFNAME=eth0
```
Если используете Docker, передавайте эти переменные через -e при запуске контейнера.
Не используйте pipeline-parallel-size > 1 на 2080TI — для Qwen2.5-7B-Instruct достаточно tensor-parallel-size=1 и pipeline-parallel-size=1 (одна карта = один процесс).

Пример команды для одной 2080TI:

vllm serve /mnt/nfs/models/Qwen2.5-VL-7B-Instruct/ --host 0.0.0.0 --port 8000 --gpu-memory-utilization 0.9 --max-model-len 4096 --dtype=half

Если хотите распределённый запуск, убедитесь, что переменные окружения заданы до запуска Ray на всех узлах, и используйте правильный интерфейс.

Хотите подробную инструкцию по настройке переменных окружения и запуску Ray/vLLM?

Sources:

Vladimir · July 21, 2025, 12:39pm

(RayWorkerWrapper pid=28215, ip=192.168.13.141) WARNING 07-21 12:37:48 [config.py:1633] Possibly too large swap space. 16.00 GiB out of the 31.19 GiB total CPU memory is allocated for the swap space.
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:00<00:00,  6.54it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:00<00:00,  5.11it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:00<00:00,  4.56it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:01<00:00,  4.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:01<00:00,  4.63it/s]

INFO 07-21 12:37:49 [default_loader.py:272] Loading weights took 1.11 seconds
INFO 07-21 12:37:50 [model_runner.py:1203] Model loading took 3.9972 GiB and 1.294743 seconds
(RayWorkerWrapper pid=28215, ip=192.168.13.141) INFO 07-21 12:37:51 [default_loader.py:272] Loading weights took 2.93 seconds
(pid=73437, ip=192.168.13.143) INFO 07-21 12:37:46 [__init__.py:244] Automatically detected platform cuda. [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=28215, ip=192.168.13.141) INFO 07-21 12:37:51 [model_runner.py:1203] Model loading took 3.9972 GiB and 3.104790 seconds
(RayWorkerWrapper pid=28215, ip=192.168.13.141) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
(RayWorkerWrapper pid=28215, ip=192.168.13.141) You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
(RayWorkerWrapper pid=73437, ip=192.168.13.143) /tmp/tmpjxa3f9r3/main.c:5:10: fatal error: Python.h: No such file or directory
(RayWorkerWrapper pid=73437, ip=192.168.13.143)     5 | #include <Python.h>
(RayWorkerWrapper pid=73437, ip=192.168.13.143)       |          ^~~~~~~~~~
(RayWorkerWrapper pid=73437, ip=192.168.13.143) compilation terminated.
(RayWorkerWrapper pid=73437, ip=192.168.13.143) The original cause of the RayTaskError (<class 'torch._inductor.exc.InductorError'>) isn't serializable: cannot pickle 'frame' object. Overwriting the cause to a RayError.
(RayWorkerWrapper pid=73437, ip=192.168.13.143) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. [repeated 2x across cluster]
(RayWorkerWrapper pid=73437, ip=192.168.13.143) You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0. [repeated 2x across cluster]
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622] Error executing method 'determine_num_available_blocks'. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622] Traceback (most recent call last):
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 614, in execute_method
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     return run_method(self, method, args, kwargs)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2736, in run_method
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     return func(*args, **kwargs)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     return func(*args, **kwargs)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/worker/worker.py", line 256, in determine_num_available_blocks
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     self.model_runner.profile_run()
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     return func(*args, **kwargs)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1300, in profile_run
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1426, in _dummy_run
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     return func(*args, **kwargs)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1844, in execute_model
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     hidden_or_intermediate_states = model_executable(
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]                                     ^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     return self._call_impl(*args, **kwargs)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     return forward_call(*args, **kwargs)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1139, in forward
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     inputs_embeds = self.get_input_embeddings_v0(
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1071, in get_input_embeddings_v0
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     inputs_embeds = self.get_input_embeddings(input_ids)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1057, in get_input_embeddings
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     inputs_embeds = self.language_model.get_input_embeddings(input_ids)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py", line 469, in get_input_embeddings
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     return self.model.get_input_embeddings(input_ids)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py", line 335, in get_input_embeddings
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     return self.embed_tokens(input_ids)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     return self._call_impl(*args, **kwargs)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     return forward_call(*args, **kwargs)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 409, in forward
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     masked_input, input_mask = get_masked_input_and_mask(
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]                                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 663, in _fn
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 760, in _compile_fx_inner
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     raise InductorError(e, currentframe()).with_traceback(
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 745, in _compile_fx_inner
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     mb_compiled_graph = fx_codegen_and_compile(
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]                         ^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1295, in fx_codegen_and_compile
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1197, in codegen_and_compile
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     compiled_fn = graph.compile_to_module().call
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]                   ^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2083, in compile_to_module
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     return self._compile_to_module()
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2091, in _compile_to_module
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]                                                              ^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2002, in codegen
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     self.scheduler.codegen()
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/_inductor/scheduler.py", line 4135, in codegen
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     else self._codegen(self.nodes)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]          ^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/_inductor/scheduler.py", line 4264, in _codegen
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     self.get_backend(device).codegen_node(node)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/_inductor/codegen/cuda_combined_scheduling.py", line 104, in codegen_node
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     return self._triton_scheduling.codegen_node(node)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/_inductor/codegen/simd.py", line 1320, in codegen_node
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     return self.codegen_node_schedule(
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/_inductor/codegen/simd.py", line 1365, in codegen_node_schedule
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     src_code = kernel.codegen_kernel()
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]                ^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/_inductor/codegen/triton.py", line 3623, in codegen_kernel
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     **self.inductor_meta_common(),
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/_inductor/codegen/triton.py", line 3447, in inductor_meta_common
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     "backend_hash": torch.utils._triton.triton_hash_with_backend(),
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/utils/_triton.py", line 111, in triton_hash_with_backend
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     backend = triton_backend()
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]               ^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/torch/utils/_triton.py", line 103, in triton_backend
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     target = driver.active.get_current_target()
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/triton/runtime/driver.py", line 23, in __getattr__
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     self._initialize_obj()
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     self._obj = self._init_fn()
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]                 ^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/triton/runtime/driver.py", line 9, in _create_driver
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     return actives[0]()
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]            ^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 535, in __init__
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     self.utils = CudaUtils()  # TODO: make static
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]                  ^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 89, in __init__
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 66, in compile_module_from_src
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/triton/runtime/build.py", line 36, in _build
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     subprocess.check_call(cc_cmd, stdout=subprocess.DEVNULL)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]   File "/usr/lib/python3.12/subprocess.py", line 413, in check_call
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622]     raise CalledProcessError(retcode, cmd)
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622] torch._inductor.exc.InductorError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpjxa3f9r3/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpjxa3f9r3/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/triton/backends/nvidia/lib', '-L/lib/x86_64-linux-gnu', '-I/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpjxa3f9r3', '-I/usr/include/python3.12']' returned non-zero exit status 1.
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622] 
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622] Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
(RayWorkerWrapper pid=73437, ip=192.168.13.143) ERROR 07-21 12:38:07 [worker_base.py:622] 
(RayWorkerWrapper pid=73437, ip=192.168.13.143) INFO 07-21 12:37:47 [cuda.py:311] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. [repeated 2x across cluster]
(RayWorkerWrapper pid=73437, ip=192.168.13.143) INFO 07-21 12:37:47 [cuda.py:360] Using XFormers backend. [repeated 2x across cluster]
(RayWorkerWrapper pid=64729, ip=192.168.13.142) INFO 07-21 12:37:48 [__init__.py:1152] Found nccl from library libnccl.so.2 [repeated 2x across cluster]
(RayWorkerWrapper pid=64729, ip=192.168.13.142) INFO 07-21 12:37:48 [pynccl.py:70] vLLM is using nccl==2.26.2 [repeated 2x across cluster]
(RayWorkerWrapper pid=73437, ip=192.168.13.143) WARNING 07-21 12:37:48 [custom_all_reduce.py:85] Custom allreduce is disabled because this process group spans across nodes. [repeated 2x across cluster]
(RayWorkerWrapper pid=28215, ip=192.168.13.141) INFO 07-21 12:37:48 [parallel_state.py:1076] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1 [repeated 2x across cluster]
(RayWorkerWrapper pid=73437, ip=192.168.13.143) INFO 07-21 12:37:48 [model_runner.py:1171] Starting to load model /mnt/nfs/models/Qwen2.5-VL-7B-Instruct/... [repeated 2x across cluster]
(RayWorkerWrapper pid=73437, ip=192.168.13.143) WARNING 07-21 12:37:48 [config.py:1633] Possibly too large swap space. 16.00 GiB out of the 31.19 GiB total CPU memory is allocated for the swap space. [repeated 2x across cluster]
(RayWorkerWrapper pid=73437, ip=192.168.13.143) INFO 07-21 12:37:51 [default_loader.py:272] Loading weights took 2.99 seconds [repeated 2x across cluster]
(RayWorkerWrapper pid=73437, ip=192.168.13.143) INFO 07-21 12:37:51 [model_runner.py:1203] Model loading took 3.9972 GiB and 3.164046 seconds [repeated 2x across cluster]

RunLLM · July 21, 2025, 12:40pm

Основная ошибка:
(RayWorkerWrapper pid=73437, ip=192.168.13.143) /tmp/tmpjxa3f9r3/main.c:5:10: fatal error: Python.h: No such file or directory
…
torch._inductor.exc.InductorError: CalledProcessError: Command ‘[’/usr/bin/gcc’, … ‘-I/usr/include/python3.12’]’ returned non-zero exit status 1.

Причина:
На одном из узлов не установлены dev-заголовки Python (python3-dev или python3.12-dev). Без них Triton/Inductor не может скомпилировать CUDA-ядра для ускорения.

Решение:
Установите пакет python3-dev (или python3.12-dev, если у вас Python 3.12) на всех узлах кластера:

sudo apt-get update
sudo apt-get install python3-dev
# или для Python 3.12:
sudo apt-get install python3.12-dev

После этого перезапустите кластер и vLLM.

Дополнительно:

Предупреждение про swap space не критично, но можно уменьшить swap через параметр --swap-space.
Предупреждение про slow image processor и preprocessor.json не мешает запуску, но для ускорения можно обновить transformers и пересохранить процессор.

Хотите подробную инструкцию по установке зависимостей на всех узлах?

Sources:

Vladimir · July 22, 2025, 12:04pm

I successfully deployed a ray cluster and launched the Qwen2.5-7B-Instruct model
Run the command
vllm serve /mnt/nfs/models/Qwen2.5-VL-7B-Instruct/ --tensor-parallel-size 4 --host 0.0.0.0 --port 8000 --distributed-executor-backend ray --gpu-memory-utilization 0.9 --max-model-len 4096 --max-num-seqs 8 --max-num-batched-tokens 4096 --block-size 16 --dtype half
It is important to note that the cluster must be assembled from the user’s half and be sure to monitor this.
export GLOO_SOCKET_IFNAME=eth0
export NCCL_SOCKET_IFNAME=eth0
for the same user
I have a question about the model’s answer
Why are the answers to the question, what’s your attitude, nonsense?
Who asks “You?” She answers

I’m friends with him.
He’s a good person, just like me.
I love him as much as he does.
I love him as much as she does.
I love him as much as you do.
In the first sentence, “he” is a man.
In the second sentence, “she” is a woman.
In the third sentence, “you” is you (or someone else).

Maybe you should try a different model?

RunLLM · July 22, 2025, 12:05pm

Ваш Ray-кластер и запуск vLLM настроены корректно, но причина “бессмысленных” или путаных ответов модели Qwen2.5-7B-Instruct — это не ошибка кластера или параметров запуска, а особенности самой модели и её обучения. Описанные вами ответы (“Я люблю его так же, как она”, “Я люблю его так же, как ты” и т.д.) — типичный пример генерации, когда модель не понимает контекст или вопрос слишком абстрактен/неоднозначен для неё.

Это подтверждается обсуждениями: если модель отвечает бессмысленно или не по теме, это связано с её архитектурой, качеством датасета или недостаточной дообученностью на подобных вопросах, а не с vLLM или Ray-кластером (см. обсуждение). Рекомендуется попробовать другую модель (например, Llama-2-7B, Mistral-7B, Mixtral-8x7B), либо переформулировать вопрос, чтобы он был более конкретным.

Хотите рекомендации по выбору модели для более осмысленных ответов?

Sources:

vllm/issues/16068

Vladimir · July 22, 2025, 12:32pm

maybe try to retrain the model?
how does the training happen?
Maybe it is possible to use promt?

RunLLM · July 22, 2025, 12:32pm

Retraining (fine-tuning) the model is possible, but vLLM is only for inference/serving and does not support training or fine-tuning. To train or fine-tune a model, use libraries like HuggingFace Transformers, PEFT, or LoRA. You can improve answers at inference time by carefully crafting your prompt (prompt engineering), but this does not change the model weights—only the input.

For better answers without retraining, try prompt engineering: rephrase or add context to your question. For actual retraining, see HuggingFace fine-tuning guide.

Would you like a step-by-step guide for prompt engineering or for fine-tuning with HuggingFace?

Sources:

Topic		Replies	Views
Error deploy model with ray General	5	196	August 17, 2025
vLLM cannot connect to existing Ray cluster Ray	16	742	May 8, 2025
Connecting to Ray Cluster and Launching vLLM with TP=8 and PP=2 in Deepseek R1 Mode on Kubernetes General	1	170	July 6, 2025
KeyError: 'model.layers.6.mlp.experts.w2_weight_scale' General	0	344	April 18, 2025
Failed to run distribute Inference with vllm 0.8.2 General	6	629	April 27, 2025

Ray cluster DeepSeek-R1-Distill-Qwen-32B-AWQ

Related topics