Help for error when run vllm with tensor parallel

when I tried to run vllm with tensor parallel, it come up with the error below:

  ~/vllm   elora-dev !3 ?10                                                                      vllm root@01e648050431  07:44:25
❯ ray stop
Stopped all 5 Ray processes.

  ~/vllm   elora-dev !3 ?10                                                                 4s  vllm root@01e648050431  07:44:31
❯ ray start --head
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 172.17.0.3

--------------------
Ray runtime started.
--------------------

Next steps
  To add another node to this Ray cluster, run
    ray start --address='172.17.0.3:6379'

  To connect to this Ray cluster:
    import ray
    ray.init()

  To terminate the Ray runtime, run
    ray stop

  To view the status of the cluster, use
    ray status

  ~/vllm   elora-dev !3 ?10                                                                 5s  vllm root@01e648050431  07:44:40
❯ ray status
======== Autoscaler status: 2025-05-21 07:44:43.159157 ========
Node status
---------------------------------------------------------------
Active:
 1 node_91ecea0e11ad34416378fdcad625e35f0a671ee556d872f4b208229d
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/48.0 CPU
 0.0/4.0 GPU
 0B/77.66GiB memory
 0B/37.28GiB object_store_memory

Demands:
 (no resource demands)

  ~/vllm   elora-dev !3 ?10                                                                      vllm root@01e648050431  07:44:45
❯ python /root/vllm/vllm/entrypoints/openai/api_server.py --model meta-llama/Llama-2-7b-hf --tensor-parallel-size 2 --lora-modules lora1=/root/vllm/elora_helper/lora_models/llama-2-7b-sql-lora-test
/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/_distutils_hack/__init__.py:53: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
  warnings.warn(
INFO 05-21 07:44:59 api_server.py:154] vLLM API server version 0.4.0
INFO 05-21 07:44:59 api_server.py:155] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=[LoRA(name='lora1', local_path='/root/vllm/elora_helper/lora_models/llama-2-7b-sql-lora-test')], chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='meta-llama/Llama-2-7b-hf', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, seed=0, swap_space=4, gpu_memory_utilization=0.9, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2025-05-21 07:45:01,217	INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 172.17.0.3:6379...
2025-05-21 07:45:01,233	INFO worker.py:1841 -- Connected to Ray cluster.
INFO 05-21 07:45:03 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='meta-llama/Llama-2-7b-hf', tokenizer='meta-llama/Llama-2-7b-hf', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
(pid=133879) /root/.anaconda3/envs/vllm/lib/python3.9/site-packages/_distutils_hack/__init__.py:53: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
(pid=133879)   warnings.warn(
(pid=133913) /root/.anaconda3/envs/vllm/lib/python3.9/site-packages/_distutils_hack/__init__.py:53: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
(pid=133913)   warnings.warn(
2025-05-21 07:45:14,499	ERROR worker.py:422 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerVllm.init_worker() (pid=133913, ip=172.17.0.3, actor_id=58c2b835b03451e98f02452701000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7fd818a8d670>)
  File "/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/torch/cuda/__init__.py", line 180, in _check_capability
    capability = get_device_capability(d)
  File "/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/torch/cuda/__init__.py", line 435, in get_device_capability
    prop = get_device_properties(device)
  File "/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/torch/cuda/__init__.py", line 453, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=

The above exception was the direct cause of the following exception:

ray::RayWorkerVllm.init_worker() (pid=133913, ip=172.17.0.3, actor_id=58c2b835b03451e98f02452701000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7fd818a8d670>)
  File "/root/vllm/vllm/engine/ray_utils.py", line 29, in init_worker
    self.worker = worker_init_fn()
  File "/root/vllm/vllm/executor/ray_gpu_executor.py", line 162, in <lambda>
    lambda rank=rank, local_rank=local_rank: Worker(
  File "/root/vllm/vllm/worker/worker.py", line 63, in __init__
    self.model_runner = ModelRunner(
  File "/root/vllm/vllm/worker/model_runner.py", line 90, in __init__
    self.attn_backend = get_attn_backend(
  File "/root/vllm/vllm/attention/selector.py", line 15, in get_attn_backend
    if _can_use_flash_attn(dtype):
  File "/root/vllm/vllm/attention/selector.py", line 32, in _can_use_flash_attn
    if torch.cuda.get_device_capability()[0] < 8:
  File "/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/torch/cuda/__init__.py", line 435, in get_device_capability
    prop = get_device_properties(device)
  File "/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/torch/cuda/__init__.py", line 449, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/torch/cuda/__init__.py", line 317, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=

CUDA call was originally invoked at:

  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 972, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 972, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/root/vllm/vllm/__init__.py", line 3, in <module>
    from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/root/vllm/vllm/engine/arg_utils.py", line 6, in <module>
    from vllm.config import (CacheConfig, DeviceConfig, LoRAConfig, ModelConfig,
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/root/vllm/vllm/config.py", line 7, in <module>
    import torch
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/torch/__init__.py", line 1332, in <module>
    _C._initExtension(manager_path())
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/torch/cuda/__init__.py", line 244, in <module>
    _lazy_call(_check_capability)
  File "/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/torch/cuda/__init__.py", line 241, in _lazy_call
    _queued_calls.append((callable, traceback.format_stack()))
INFO 05-21 07:45:14 selector.py:16] Using FlashAttention backend.
(RayWorkerVllm pid=133913) ERROR 05-21 07:45:14 ray_utils.py:44] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerVllm pid=133913) ERROR 05-21 07:45:14 ray_utils.py:44] Traceback (most recent call last):
(RayWorkerVllm pid=133913) ERROR 05-21 07:45:14 ray_utils.py:44]   File "/root/vllm/vllm/engine/ray_utils.py", line 36, in execute_method
(RayWorkerVllm pid=133913) ERROR 05-21 07:45:14 ray_utils.py:44]     executor = getattr(self, method)
(RayWorkerVllm pid=133913) ERROR 05-21 07:45:14 ray_utils.py:44]   File "/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py", line 463, in _resume_span
(RayWorkerVllm pid=133913) ERROR 05-21 07:45:14 ray_utils.py:44]     return method(self, *_args, **_kwargs)
(RayWorkerVllm pid=133913) ERROR 05-21 07:45:14 ray_utils.py:44]   File "/root/vllm/vllm/engine/ray_utils.py", line 32, in __getattr__
(RayWorkerVllm pid=133913) ERROR 05-21 07:45:14 ray_utils.py:44]     return getattr(self.worker, name)
(RayWorkerVllm pid=133913) ERROR 05-21 07:45:14 ray_utils.py:44] AttributeError: 'NoneType' object has no attribute 'init_device'


but, I find that there are 4 GPUs:

❯ python
Python 3.9.21 | packaged by conda-forge | (main, Dec  5 2024, 13:51:40)
[GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>>
>>> print(f"PyTorch version: {torch.__version__}")
PyTorch version: 2.1.2+cu121
>>> print(f"CUDA available: {torch.cuda.is_available()}")
CUDA available: True
>>> if torch.cuda.is_available():
...     print(f"CUDA version: {torch.version.cuda}")
...     print(f"Number of GPUs: {torch.cuda.device_count()}")
...     for i in range(torch.cuda.device_count()):
...         print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
...         print(f"  Capability: {torch.cuda.get_device_capability(i)}")
... else:
...     print("CUDA not available. Please check your installation.")
...
CUDA version: 12.1
Number of GPUs: 4
GPU 0: NVIDIA A40
  Capability: (8, 6)
GPU 1: NVIDIA A40
  Capability: (8, 6)
GPU 2: NVIDIA A40
  Capability: (8, 6)
GPU 3: NVIDIA A40
  Capability: (8, 6)

also with nvidia-smi

 nvidia-smi

Wed May 21 07:42:21 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A40                     Off |   00000000:56:00.0 Off |                    0 |
|  0%   43C    P0            101W /  300W |   12271MiB /  46068MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A40                     Off |   00000000:57:00.0 Off |                    0 |
|  0%   30C    P8             22W /  300W |       4MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A40                     Off |   00000000:D1:00.0 Off |                    0 |
|  0%   31C    P8             23W /  300W |       4MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A40                     Off |   00000000:D6:00.0 Off |                    0 |
|  0%   33C    P8             32W /  300W |       4MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

I would appreciate for any advice

The error RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50 when using tensor parallelism in vLLM is typically caused by a mismatch between the number of GPUs detected by PyTorch and the number of GPUs vLLM expects to use. This can happen if CUDA is initialized before vLLM starts, or if the environment variables (like CUDA_VISIBLE_DEVICES) are not set correctly before launching vLLM. Even if torch.cuda.device_count() and nvidia-smi show 4 GPUs, vLLM requires that CUDA is not initialized in your Python process before vLLM starts, and that the tensor parallel size divides the number of attention heads in the model evenly.

Key points and solutions:

  • Do not call any CUDA-related functions (e.g., torch.cuda.is_available(), torch.cuda.device_count()) before initializing vLLM. This can cause CUDA to be initialized in the parent process, which breaks multiprocessing/forking used by vLLM for tensor parallelism. See vLLM offline inference docs and related issues.
  • Make sure your tensor_parallel_size is a divisor of the model’s number of attention heads. For Llama-2-7b, 32 heads, so 2, 4, 8, 16 are valid. See this issue.
  • Set CUDA_VISIBLE_DEVICES before starting vLLM to control which GPUs are visible, rather than relying on PyTorch’s device selection.
  • If you are running in a Docker container, ensure the container is started with --gpus all and that the NVIDIA runtime is properly configured.

Example:

export CUDA_VISIBLE_DEVICES=0,1,2,3
python /root/vllm/vllm/entrypoints/openai/api_server.py --model meta-llama/Llama-2-7b-hf --tensor-parallel-size 2

Do not run any Python code that touches CUDA before this command.

Would you like a step-by-step troubleshooting checklist?

Sources:

I tried to restart my docker(then no Python code will touches CUDA before the start of vllm).
It works without the bug ahead, but it stuck on code below:

Last login: Wed May 21 16:17:10 on ttys002
░▒▓  │  ~ ······························································································· base  │ 16:21:01  ▓▒░─╮
❯ ssh epcc-gpu-A40-2                                                                                                                 ─╯
Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-88-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

 * Introducing Expanded Security Maintenance for Applications.
   Receive updates to over 25,000 software packages with your
   Ubuntu Pro subscription. Free for personal use.

     https://ubuntu.com/pro

Expanded Security Maintenance for Infrastructure is not enabled.

401 updates can be applied immediately.
318 of these updates are standard security updates.
To see these additional updates run: apt list --upgradable

Enable ESM Infra to receive additional future security updates.
See https://ubuntu.com/esm or run: sudo pro status

New release '22.04.5 LTS' available.
Run 'do-release-upgrade' to upgrade to it.

Your Hardware Enablement Stack (HWE) is supported until April 2025.
Web console: https://a40-02:9090/ or https://10.2.64.74:9090/

Last login: Wed May 21 16:17:13 2025 from 10.2.0.1
 ~ ................................................................................................. base py  yixwang@a40-02  16:21:04
> docker restart yixwang-elora
yixwang-elora

 ~ ............................................................................................ 12s  base py  yixwang@a40-02  16:21:20
> docker exec -it yixwang-elora zsh
  /workspace                                                                                       base root@01e648050431  08:21:23
❯ cd ~

  ~                                                                                                base root@01e648050431  08:21:26
❯ conda activate vllm

  ~                                                                                                vllm root@01e648050431  08:21:30
❯ ray start --head
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 172.17.0.3

--------------------
Ray runtime started.
--------------------

Next steps
  To add another node to this Ray cluster, run
    ray start --address='172.17.0.3:6379'

  To connect to this Ray cluster:
    import ray
    ray.init()

  To terminate the Ray runtime, run
    ray stop

  To view the status of the cluster, use
    ray status

  ~                                                                                           3s  vllm root@01e648050431  08:21:41
❯ python /root/vllm/vllm/entrypoints/openai/api_server.py --model meta-llama/Llama-2-7b-hf --tensor-parallel-size 2 --lora-modules lora1=/root/vllm/elora_helper/lora_models/llama-2-7b-sql-lora-test
/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/_distutils_hack/__init__.py:53: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
  warnings.warn(
INFO 05-21 08:21:54 api_server.py:154] vLLM API server version 0.4.0
INFO 05-21 08:21:54 api_server.py:155] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=[LoRA(name='lora1', local_path='/root/vllm/elora_helper/lora_models/llama-2-7b-sql-lora-test')], chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='meta-llama/Llama-2-7b-hf', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, seed=0, swap_space=4, gpu_memory_utilization=0.9, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2025-05-21 08:21:55,353	INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 172.17.0.3:6379...
2025-05-21 08:21:55,370	INFO worker.py:1841 -- Connected to Ray cluster.
INFO 05-21 08:21:57 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='meta-llama/Llama-2-7b-hf', tokenizer='meta-llama/Llama-2-7b-hf', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 05-21 08:22:04 selector.py:16] Using FlashAttention backend.
(RayWorkerVllm pid=2475) /root/.anaconda3/envs/vllm/lib/python3.9/site-packages/_distutils_hack/__init__.py:53: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
(RayWorkerVllm pid=2475)   warnings.warn(
(RayWorkerVllm pid=2475) INFO 05-21 08:22:06 selector.py:16] Using FlashAttention backend.
INFO 05-21 08:22:06 pynccl_utils.py:45] vLLM is using nccl==2.18.1
(RayWorkerVllm pid=2475) INFO 05-21 08:22:06 pynccl_utils.py:45] vLLM is using nccl==2.18.1

When vLLM “stuck” after printing lines like Using FlashAttention backend. and vLLM is using nccl==2.18.1, it usually means the process is waiting for all tensor parallel workers to initialize and synchronize, but something is blocking the distributed setup. This is a common issue with multi-GPU/tensor-parallel vLLM setups.

Checklist to resolve:

  1. Sufficient Shared Memory:
    Ensure your Docker container has enough shared memory (/dev/shm). For multi-GPU, this is critical. Start your container with --shm-size=32g (or higher, e.g. --shm-size=64g for large models)
    See vllm/issues/1437.

  2. No CUDA Initialization Before vLLM:
    You already restarted Docker, which is good. Make sure no other process or script in the container touches CUDA before vLLM starts.

  3. Correct NCCL and CUDA Environment:

    • Set NCCL_P2P_DISABLE=1 as an environment variable before starting vLLM. This can help if your GPUs are not fully NVLink-connected or if you see NCCL hangs.
    • Set CUDA_VISIBLE_DEVICES=0,1 (or whichever GPUs you want to use) before running vLLM.
    • If running in Docker, use --gpus all and the NVIDIA runtime.
  4. Ray and vLLM Version Compatibility:
    Some Ray/vLLM version mismatches can cause hangs. Use recent, matching versions.

  5. Wait for Model Loading:
    For large models, loading and initializing can take several minutes, especially on first run or with slow disks. Monitor nvidia-smi to see if GPU memory usage increases.

  6. Check for Deadlocks or Errors:
    If it hangs for more than 10-15 minutes, check Ray logs (/tmp/ray/session_latest/logs/) for errors, or run with debug logging:

    export VLLM_LOGGING_LEVEL=DEBUG
    export NCCL_DEBUG=TRACE
    python /root/vllm/vllm/entrypoints/openai/api_server.py ...
    

    See vLLM troubleshooting.

Example Docker run:

docker run --gpus all --shm-size=32g --ipc=host -it <your_image>

Would you like more detailed step-by-step debugging instructions?

Sources:

I found it stuck at here:

  ~                                                                                           4s  vllm root@01e648050431  08:40:10
❯ export VLLM_LOGGING_LEVEL=DEBUG
export NCCL_DEBUG=TRACE


  ~                                                                                                vllm root@01e648050431  08:40:12
❯ python /root/vllm/vllm/entrypoints/openai/api_server.py --model meta-llama/Llama-2-7b-hf --tensor-parallel-size 2 --lora-modules lora1=/root/vllm/elora_helper/lora_models/llama-2-7b-sql-lora-test
/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/_distutils_hack/__init__.py:53: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
  warnings.warn(
INFO 05-21 08:40:22 api_server.py:154] vLLM API server version 0.4.0
INFO 05-21 08:40:22 api_server.py:155] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=[LoRA(name='lora1', local_path='/root/vllm/elora_helper/lora_models/llama-2-7b-sql-lora-test')], chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='meta-llama/Llama-2-7b-hf', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, seed=0, swap_space=4, gpu_memory_utilization=0.9, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2025-05-21 08:40:23,486	INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 172.17.0.3:6379...
2025-05-21 08:40:23,499	INFO worker.py:1841 -- Connected to Ray cluster.
INFO 05-21 08:40:25 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='meta-llama/Llama-2-7b-hf', tokenizer='meta-llama/Llama-2-7b-hf', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 05-21 08:40:32 selector.py:16] Using FlashAttention backend.
(RayWorkerVllm pid=2522) /root/.anaconda3/envs/vllm/lib/python3.9/site-packages/_distutils_hack/__init__.py:53: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
(RayWorkerVllm pid=2522)   warnings.warn(
(RayWorkerVllm pid=2522) INFO 05-21 08:40:33 selector.py:16] Using FlashAttention backend.
INFO 05-21 08:40:34 pynccl_utils.py:45] vLLM is using nccl==2.18.1
(RayWorkerVllm pid=2522) INFO 05-21 08:40:34 pynccl_utils.py:45] vLLM is using nccl==2.18.1
01e648050431:2355:2355 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0>
01e648050431:2355:2355 [0] NCCL INFO cudaDriverVersion 12080
NCCL version 2.18.1+cuda12.1
01e648050431:2355:5554 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
01e648050431:2355:5554 [0] NCCL INFO P2P plugin IBext
01e648050431:2355:5554 [0] NCCL INFO NET/IB : No device found.
01e648050431:2355:5554 [0] NCCL INFO NET/IB : No device found.
01e648050431:2355:5554 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0>
01e648050431:2355:5554 [0] NCCL INFO Using network Socket
01e648050431:2355:5554 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ff000fff
01e648050431:2355:5554 [0] NCCL INFO Channel 00/04 :    0   1
01e648050431:2355:5554 [0] NCCL INFO Channel 01/04 :    0   1
01e648050431:2355:5554 [0] NCCL INFO Channel 02/04 :    0   1
01e648050431:2355:5554 [0] NCCL INFO Channel 03/04 :    0   1
01e648050431:2355:5554 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
01e648050431:2355:5554 [0] NCCL INFO P2P Chunksize set to 131072
01e648050431:2355:5554 [0] NCCL INFO Channel 00/0 : 0[56000] -> 1[57000] via P2P/IPC
01e648050431:2355:5554 [0] NCCL INFO Channel 01/0 : 0[56000] -> 1[57000] via P2P/IPC
01e648050431:2355:5554 [0] NCCL INFO Channel 02/0 : 0[56000] -> 1[57000] via P2P/IPC
01e648050431:2355:5554 [0] NCCL INFO Channel 03/0 : 0[56000] -> 1[57000] via P2P/IPC
01e648050431:2355:5554 [0] NCCL INFO Connected all rings
01e648050431:2355:5554 [0] NCCL INFO Connected all trees
01e648050431:2355:5554 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
01e648050431:2355:5554 [0] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
01e648050431:2355:5554 [0] NCCL INFO comm 0x55c04da80680 rank 0 nranks 2 cudaDev 0 busId 56000 commId 0xe50ea1d541ba405d - Init COMPLETE
01e648050431:2355:2355 [0] NCCL INFO Using network Socket

Your logs show that NCCL initialized successfully and is using the network socket for inter-GPU communication, but vLLM appears to hang after this point. This is a common symptom when running vLLM with tensor parallelism in Docker, and is usually caused by one of the following:

  1. Insufficient shared memory: vLLM and NCCL require a large /dev/shm for inter-process communication. If your Docker container was not started with a large enough --shm-size (e.g., at least --shm-size=16g), the process can hang at NCCL initialization.
    See vllm/issues/1437 and vLLM Docker docs.

  2. CUDA IPC issues: If your container is missing the --ipc=host flag, CUDA IPC (used for P2P/IPC) may not work, causing hangs.
    See vllm/issues/3710.

  3. Model loading time: For large models, initialization can take several minutes, especially on first run or with slow disks. Monitor nvidia-smi to see if GPU memory usage increases.

Recommended solution:
Restart your container with both --shm-size=16g (or higher) and --ipc=host:

docker run --gpus all --shm-size=16g --ipc=host -it <your_image>

Then rerun your vLLM command.

Would you like more detailed troubleshooting steps?

Sources:

I found my GPU works:

Wed May 21 08:48:29 2025                        (Press h for help or q to quit)
╒═════════════════════════════════════════════════════════════════════════════╕
│ NVITOP 1.4.2      Driver Version: 570.124.06      CUDA Driver Version: 12.8 │
├───────────────────────────────┬──────────────────────┬──────────────────────┤
│ GPU Fan Temp Perf Pwr:Usg/Cap │         Memory-Usage │ GPU-Util  Compute M. │
╞═══════════════════════════════╪══════════════════════╪══════════════════════╪═══════════════════════════╤════════════════════════════╕
│   0  0%  43C  P0  105W / 300W │  12990MiB / 44.99GiB │    100%      Default │ MEM: ████▌ 28.2%          │ UTL: █████████████████ MAX │
├───────────────────────────────┼──────────────────────┼──────────────────────┼───────────────────────────┼────────────────────────────┤
│   1  0%  38C  P0   83W / 300W │  987.4MiB / 44.99GiB │    100%      Default │ MEM: ▍ 2.1%               │ UTL: █████████████████ MAX │
├───────────────────────────────┼──────────────────────┼──────────────────────┼───────────────────────────┼────────────────────────────┤
│   2  0%  30C  P8   23W / 300W │  572.1MiB / 44.99GiB │      0%      Default │ MEM: ▎ 1.2%               │ UTL: ▏ 0%                  │
├───────────────────────────────┼──────────────────────┼──────────────────────┼───────────────────────────┼────────────────────────────┤
│   3  0%  33C  P8   33W / 300W │  572.1MiB / 44.99GiB │      0%      Default │ MEM: ▎ 1.2%               │ UTL: ▏ 0%                  │
╘═══════════════════════════════╧══════════════════════╧══════════════════════╧═══════════════════════════╧════════════════════════════╛
[ CPU: ████████████████████████████████████▍ 42.4%                              UPTIME: 54.8 days ]  ( Load Average: 19.88 20.43 20.54 )
[ MEM: █████████████████████████ 29.0%                                             USED: 34.88GiB ]  [ SWP: ████████████████████▌ 93%  ]
 !CAUTION: SUPERUSER LOGGED-IN.                                                          (Press ^C(INT)/T(TERM)/K(KILL) to send signals)
╒══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│ Processes:                                                                                                         root@01e648050431 │
│ GPU     PID      USER  GPU-MEM %SM  %CPU  %MEM  TIME  COMMAND                                                                        │
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│   0    2355 C    root 410.0MiB   0  99.9   0.8  8:13  python /root/vllm/vllm/entrypoints/openai/api_server.py --model meta-llama/L.. │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│   1    2522 C    root 410.0MiB 100 100.4   0.7  8:05  ray::RayWorkerVllm.execute_method                                              │
╘══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛








but it still sucks

Your GPU utilization and memory usage show that the vLLM processes are running, but the model is not fully loaded or serving requests—this is a classic symptom of a distributed initialization hang in multi-GPU Docker setups. The most common causes and solutions, based on your logs and GPU state, are:


1. Docker Shared Memory and IPC Settings

  • You must start your container with both --shm-size=16g (or higher) and --ipc=host.
    Without these, CUDA IPC and NCCL can hang indefinitely at initialization, even if the GPUs are visible and partially used.
  • Example:
    docker run --gpus all --shm-size=16g --ipc=host -it <your_image>
    

2. NCCL Network Backend Fallback

  • Your logs show NCCL INFO Using network Socket and NET/IB : No device found.
    This means NCCL is falling back to TCP sockets, which is slow but should not hang. However, if there are firewall or Docker network issues, it can still deadlock.
  • If you are not using Infiniband, this is expected, but you must ensure all Ray/vLLM processes can communicate over the Docker network.

3. Ray and vLLM Multiprocessing

  • If you see both python ...api_server.py and ray::RayWorkerVllm.execute_method at 100% CPU, but no progress, it’s likely a deadlock in inter-process communication.
  • Try setting the environment variable NCCL_P2P_DISABLE=1 before starting vLLM to force NCCL to avoid problematic peer-to-peer paths:
    export NCCL_P2P_DISABLE=1
    

4. Model Loading Time

  • For large models, initial loading can take several minutes, but you should see GPU memory usage increase steadily. If it stays flat for >10 minutes, it’s likely a hang, not just slow loading.

5. Debugging Further

  • Check /dev/shm inside the container: it should be large (e.g., df -h /dev/shm).
  • Check Ray logs in /tmp/ray/session_latest/logs/ for errors or deadlocks.
  • Try running with only one GPU (--tensor-parallel-size 1) to confirm the model loads and serves correctly.

Summary of what to try next:

  1. Stop all containers and Ray processes.
  2. Start your Docker container with --shm-size=16g --ipc=host.
  3. Set NCCL_P2P_DISABLE=1 before launching vLLM.
  4. Run vLLM as before.

If it still hangs, please check the Ray logs and let me know if you see any errors.

Would you like a step-by-step Docker command and environment setup?

Sources: