Help for error when run vllm with tensor parallel

When vLLM “stuck” after printing lines like Using FlashAttention backend. and vLLM is using nccl==2.18.1, it usually means the process is waiting for all tensor parallel workers to initialize and synchronize, but something is blocking the distributed setup. This is a common issue with multi-GPU/tensor-parallel vLLM setups.

Checklist to resolve:

  1. Sufficient Shared Memory:
    Ensure your Docker container has enough shared memory (/dev/shm). For multi-GPU, this is critical. Start your container with --shm-size=32g (or higher, e.g. --shm-size=64g for large models)
    See vllm/issues/1437.

  2. No CUDA Initialization Before vLLM:
    You already restarted Docker, which is good. Make sure no other process or script in the container touches CUDA before vLLM starts.

  3. Correct NCCL and CUDA Environment:

    • Set NCCL_P2P_DISABLE=1 as an environment variable before starting vLLM. This can help if your GPUs are not fully NVLink-connected or if you see NCCL hangs.
    • Set CUDA_VISIBLE_DEVICES=0,1 (or whichever GPUs you want to use) before running vLLM.
    • If running in Docker, use --gpus all and the NVIDIA runtime.
  4. Ray and vLLM Version Compatibility:
    Some Ray/vLLM version mismatches can cause hangs. Use recent, matching versions.

  5. Wait for Model Loading:
    For large models, loading and initializing can take several minutes, especially on first run or with slow disks. Monitor nvidia-smi to see if GPU memory usage increases.

  6. Check for Deadlocks or Errors:
    If it hangs for more than 10-15 minutes, check Ray logs (/tmp/ray/session_latest/logs/) for errors, or run with debug logging:

    export VLLM_LOGGING_LEVEL=DEBUG
    export NCCL_DEBUG=TRACE
    python /root/vllm/vllm/entrypoints/openai/api_server.py ...
    

    See vLLM troubleshooting.

Example Docker run:

docker run --gpus all --shm-size=32g --ipc=host -it <your_image>

Would you like more detailed step-by-step debugging instructions?

Sources: