When vLLM “stuck” after printing lines like Using FlashAttention backend. and vLLM is using nccl==2.18.1, it usually means the process is waiting for all tensor parallel workers to initialize and synchronize, but something is blocking the distributed setup. This is a common issue with multi-GPU/tensor-parallel vLLM setups.
Checklist to resolve:
-
Sufficient Shared Memory:
Ensure your Docker container has enough shared memory (/dev/shm). For multi-GPU, this is critical. Start your container with--shm-size=32g(or higher, e.g.--shm-size=64gfor large models)
See vllm/issues/1437. -
No CUDA Initialization Before vLLM:
You already restarted Docker, which is good. Make sure no other process or script in the container touches CUDA before vLLM starts. -
Correct NCCL and CUDA Environment:
- Set
NCCL_P2P_DISABLE=1as an environment variable before starting vLLM. This can help if your GPUs are not fully NVLink-connected or if you see NCCL hangs. - Set
CUDA_VISIBLE_DEVICES=0,1(or whichever GPUs you want to use) before running vLLM. - If running in Docker, use
--gpus alland the NVIDIA runtime.
- Set
-
Ray and vLLM Version Compatibility:
Some Ray/vLLM version mismatches can cause hangs. Use recent, matching versions. -
Wait for Model Loading:
For large models, loading and initializing can take several minutes, especially on first run or with slow disks. Monitornvidia-smito see if GPU memory usage increases. -
Check for Deadlocks or Errors:
If it hangs for more than 10-15 minutes, check Ray logs (/tmp/ray/session_latest/logs/) for errors, or run with debug logging:export VLLM_LOGGING_LEVEL=DEBUG export NCCL_DEBUG=TRACE python /root/vllm/vllm/entrypoints/openai/api_server.py ...See vLLM troubleshooting.
Example Docker run:
docker run --gpus all --shm-size=32g --ipc=host -it <your_image>
Would you like more detailed step-by-step debugging instructions?
Sources: