RTX Pro 6000 Tensor Parallelism CUBLAS_STATUS_ALLOC_FAILED

I am trying to run meta-llama/Llama-2-70b-chat-hf across 4 RTX Pro 6000 with TP 4, but I keep running into cublass allocation errors. I have tried using smaller models, even smaller batch sizes and max seq length different TP sizes using TP 2, PP 2. Nothing seems to work with Tensor Parallelism. However same model when used with Pipeline parallelism works aok (just with insanely slow container load times but that’s a seperate issue).

I am on Cuda 12.9 with Driver version 575.57.08 using the latest docker container

Execution Log: ```INFO 09-13 09:49:49 [__init__.py:241] Automatically detected platform cuda. - Pastebin.com

Tensor Parallelism (TP) in vLLM requires that the number of attention heads in the model is divisible by the tensor_parallel_size. For Llama-2-70b, if this condition is not met (e.g., using TP=3 or TP=4 when the model’s head count is not divisible by 3 or 4), you will encounter allocation or initialization errors. Additionally, some users have reported that certain GPU models (like RTX Pro 6000) and CUDA versions may have compatibility issues with TP, leading to cublas allocation errors. Pipeline Parallelism (PP) does not have this divisibility requirement and is often more robust in such heterogeneous or constrained environments, though it may be slower to initialize and run due to increased inter-GPU communication and layer partitioning overheads.

If you must use TP, ensure the tensor_parallel_size divides the number of attention heads (for Llama-2-70b, 8 or 16 are typical choices). If your GPUs have limited VRAM, also check that the model fits after sharding. If TP still fails, try updating your NCCL and CUDA libraries, or use PP as a fallback, as you have done. For more details, see the discussion in vLLM Issue #11147 and vLLM Issue #5003. Would you like a step-by-step troubleshooting guide or more details on optimizing PP performance?

Sources:

head count is divisible by the TP sizes being tested

If the attention head count is divisible by your tensor_parallel_size and you still encounter cublas allocation errors, the issue is likely related to GPU memory exhaustion or compatibility between vLLM, CUDA 12.9, and your RTX Pro 6000 GPUs. This is a known problem with some Blackwell/Lovelace GPUs and recent CUDA/NCCL versions, where TP triggers cublas or allocation errors even when configuration is correct. Users have reported that upgrading to the latest NCCL (e.g., nvidia-nccl-cu12==2.27.3) can resolve these allocation issues on similar hardware and CUDA versions, as seen in vLLM Issue #21239.

If you have not already, try updating your NCCL library to the latest version. If the problem persists, it may be a compatibility issue that is not yet fully resolved in vLLM for your hardware/CUDA combination. Would you like more detailed troubleshooting steps or guidance on alternative parallelism strategies?

Sources: