Does vllm inference work with Qwen3-VL-30B-A3B-Instruct
and 2x 5090?
Not able to get it working, complains always
RuntimeError: Worker failed with error 'CUDA out of memory. Tried to allocate 426.00 MiB. GPU 0 has a total capacity of 31.36 GiB of which 397.38 MiB is free.
vLLM inference with Qwen3-VL-30B-A3B-Instruct on 2x RTX 5090 is not currently stable—users consistently report CUDA out-of-memory (OOM) errors, even with reasonable settings for --gpu-memory-utilization, --max-model-len, and --tensor-parallel-size. This is a known issue, and there is no confirmed solution in the official vLLM releases as of now. Some users have tried various workarounds (e.g., adjusting memory utilization, disabling chunked prefill, or using different Docker images), but OOM or illegal memory access errors persist, especially with tensor parallelism on dual 5090s or similar setups. The problem is not unique to your configuration and is actively discussed in the vLLM community and GitHub issues, with the root cause likely related to model size, memory fragmentation, or incomplete support for Blackwell GPUs in current vLLM and PyTorch builds. See the detailed error logs and user reports for reference: vllm-project/vllm#18814.
If you need to run this model, you may try reducing --max-model-len, --max-num-seqs, or using a quantized/FP8 variant if available, but success is not guaranteed. Some users have reported that building vLLM and PyTorch from source with explicit CUDA arch flags for 5090 (sm_120) and using the latest nightly builds can help, but even then, tensor parallelism often fails with OOM or illegal memory access. For now, there is no official fix, and the issue is under investigation by the vLLM team and community. Would you like more detail on possible workarounds or alternative deployment strategies?
vLLM supports AMD Radeon 7900 XTX (gfx1100) for inference, but performance and compatibility are limited compared to MI300/MI200 series. You can run models like Qwen3-VL-30B-A3B-Instruct on 7900 XTX, but you may encounter slow inference, limited quantization support (AWQ is not optimized), and must often disable FlashAttention or use workarounds (e.g., set VLLM_USE_TRITON_FLASH_ATTN=0). Some users report successful runs with small models or reduced settings, but large models (30B+) are likely to be slow or unstable due to lack of optimized kernels and lower VRAM. See vllm-project/vllm#4514 and vllm-project/vllm#11249.
In summary, it is possible to run some versions of large models on 7900 XTX with vLLM, but expect slow performance and limited quantization. For best results, use MI300/MI200 series or NVIDIA GPUs. Would you like more detail on specific configuration steps or model recommendations?