vLLM on RTX5090: Working GPU setup with torch 2.9.0 cu128

I ran into the exact same issue on WSL (Ubuntu 24.04) and it turned out to be a binary compatibility problem between PyTorch, xFormers, FlashInfer, and the vLLM CUDA extension.

What worked for me was building the dependencies explicitly and then installing vLLM from source instead of using prebuilt wheels.

Steps that fixed it for me:

First, install a compatible xFormers version:

pip install --no-cache-dir "xformers==0.0.33.post1"

Then build FlashInfer from source:

git clone https://github.com/flashinfer-ai/flashinfer.git --branch main --recursive ./flashinfer
cd ./flashinfer
git checkout cd928a7e044c94bdd96e3f7ca79a0514b253ea6d

pip install --no-cache-dir ninja build packaging "setuptools>=75.6.0"

python -m flashinfer.aot
python -m build --no-isolation --wheel
pip install dist/flashinfer*.whl

After that, I installed vLLM from source (not pip wheels), following the official source install steps.