First, I’d like to say thank you so much for this excellent LLM software and for the hard work. I currently have an issue. I have an RTX 5090 with sm120 blackwell, and have been trying to set this up for the past few days with GLM 4.6 and 4.7 flash. I tried building a custom vLLM and it just doesnt work due to incompatibility. I would really like to use vLLM over llama.cpp and ollama as I notice the performance on vLLM is out of this world. It is the fastest I’ve ever seen Local AI (even the 3rd party AI platforms) go as far as performance.
Official support for RTX 5090 (Blackwell, sm120) has recently landed in vLLM, but pre-built wheels and Docker images may still lack full compatibility due to CUDA architecture flags and PyPI wheel size limits. The recommended solution is to build vLLM from source with CUDA 12.8+ and the correct torch_cuda_arch_list (including “12.0 12.1”) for Blackwell GPUs. This approach has been confirmed to work by several users, and official Docker images with SM120 support are expected in upcoming releases. See detailed build steps and discussion in this issue and this issue.
For GLM 4.6/4.7 and FlashInfer, ensure you use the latest PyTorch nightly (2.9+), vLLM 0.9.2 or newer, and set the appropriate environment variables for FlashInfer and MoE FP4/FP8 support. Example Docker build/run commands and troubleshooting tips are provided in the linked issues. Would you like a step-by-step guide for building and running vLLM on RTX 5090 with GLM 4.6/4.7?