Does vllm inference work with Qwen3-VL-30B

youlearnit · November 30, 2025, 5:05pm

Does vllm inference work with Qwen3-VL-30B-A3B-Instruct
and 2x 5090?

Not able to get it working, complains always

RuntimeError: Worker failed with error 'CUDA out of memory. Tried to allocate 426.00 MiB. GPU 0 has a total capacity of 31.36 GiB of which 397.38 MiB is free.

docker run -d --gpus all --ipc=host --network host -p 8000:8000 -v /home/ubuntu/vllm_models:/root/.cache/huggingface -e HUGGING_FACE_HUB_TOKEN=“” -e VLLM_SLEEP_WHEN_IDLE=1 -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True --shm-size=10g --ulimit memlock=-1 --ulimit stack=67108864 --restart unless-stopped --name vllm_Qwen3-VL-30B-A3B-Instruct vllm/vllm-openai:latest --model Qwen/Qwen3-VL-30B-A3B-Instruct --served-model-name vllm/gemma-3 --dtype auto --max-model-len 2048 --max-num-seqs 2 --gpu-memory-utilization 0.85 --enable-prefix-caching --enable-chunked-prefill --swap-space 12 --tensor-parallel-size 2 --max-num-batched-tokens 4096 --disable-log-requests --max-log-len 100

RunLLM · November 30, 2025, 5:06pm

vLLM inference with Qwen3-VL-30B-A3B-Instruct on 2x RTX 5090 is not currently stable—users consistently report CUDA out-of-memory (OOM) errors, even with reasonable settings for --gpu-memory-utilization, --max-model-len, and --tensor-parallel-size. This is a known issue, and there is no confirmed solution in the official vLLM releases as of now. Some users have tried various workarounds (e.g., adjusting memory utilization, disabling chunked prefill, or using different Docker images), but OOM or illegal memory access errors persist, especially with tensor parallelism on dual 5090s or similar setups. The problem is not unique to your configuration and is actively discussed in the vLLM community and GitHub issues, with the root cause likely related to model size, memory fragmentation, or incomplete support for Blackwell GPUs in current vLLM and PyTorch builds. See the detailed error logs and user reports for reference: vllm-project/vllm#18814.

If you need to run this model, you may try reducing --max-model-len, --max-num-seqs, or using a quantized/FP8 variant if available, but success is not guaranteed. Some users have reported that building vLLM and PyTorch from source with explicit CUDA arch flags for 5090 (sm_120) and using the latest nightly builds can help, but even then, tensor parallelism often fails with OOM or illegal memory access. For now, there is no official fix, and the issue is under investigation by the vLLM team and community. Would you like more detail on possible workarounds or alternative deployment strategies?

Sources:

Hint: Mention @RunLLM in the post for followups.

youlearnit · November 30, 2025, 5:14pm

@RunLLM If blackwell has problem has 7900 XTX problems?
Is any version of that model possible to run with vllm?
2x 5090 or 2x 7900 xtx

RunLLM · November 30, 2025, 5:14pm

vLLM supports AMD Radeon 7900 XTX (gfx1100) for inference, but performance and compatibility are limited compared to MI300/MI200 series. You can run models like Qwen3-VL-30B-A3B-Instruct on 7900 XTX, but you may encounter slow inference, limited quantization support (AWQ is not optimized), and must often disable FlashAttention or use workarounds (e.g., set VLLM_USE_TRITON_FLASH_ATTN=0). Some users report successful runs with small models or reduced settings, but large models (30B+) are likely to be slow or unstable due to lack of optimized kernels and lower VRAM. See vllm-project/vllm#4514 and vllm-project/vllm#11249.

In summary, it is possible to run some versions of large models on 7900 XTX with vLLM, but expect slow performance and limited quantization. For best results, use MI300/MI200 series or NVIDIA GPUs. Would you like more detail on specific configuration steps or model recommendations?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
vLLM Tensor Parallel Workers Not Completing Initialization General	5	1649	May 4, 2026
Weird benchmarking results regardin QwenVL/30B/8B/4B [solved - moe xd] General	4	790	October 21, 2025
H20 running Qwen3-30B-A3B-AWQ failed General	2	137	November 27, 2025
torch.OutOfMemoryError: CUDA out of memory General	15	1759	March 3, 2026
Hosting Qwen 3.5 35B-A3B model Model Support	1	1245	February 25, 2026

Does vllm inference work with Qwen3-VL-30B

Related topics