RTX Pro 6000 Tensor Parallelism CUBLAS_STATUS_ALLOC_FAILED

If the attention head count is divisible by your tensor_parallel_size and you still encounter cublas allocation errors, the issue is likely related to GPU memory exhaustion or compatibility between vLLM, CUDA 12.9, and your RTX Pro 6000 GPUs. This is a known problem with some Blackwell/Lovelace GPUs and recent CUDA/NCCL versions, where TP triggers cublas or allocation errors even when configuration is correct. Users have reported that upgrading to the latest NCCL (e.g., nvidia-nccl-cu12==2.27.3) can resolve these allocation issues on similar hardware and CUDA versions, as seen in vLLM Issue #21239.

If you have not already, try updating your NCCL library to the latest version. If the problem persists, it may be a compatibility issue that is not yet fully resolved in vLLM for your hardware/CUDA combination. Would you like more detailed troubleshooting steps or guidance on alternative parallelism strategies?

Sources: