Weird benchmarking results regardin QwenVL/30B/8B/4B [solved - moe xd]

jere357 · October 21, 2025, 10:01am

Hey people, im setting up a vllm server on a machine with 4 GPUs (A6000), and im running into some bizzare results while benchmarking avg request time for each models size.

my nvidia drivers are 535.65, CUDA version from nvcc --version is 11.5

the 30B model is placed on 2 GPUs using tensor-paralel 2

CUDA_VISIBLE_DEVICES=2,3 vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct --tensor-parallel-size 2 --limit-mm-per-prompt.video 0 --async-scheduling --mm-processor-cache-gb 0 --max-model-len 8000 --max_num_seqs 1 --gpu-memory-utilization 0.95
CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-VL-8B-Instruct --limit-mm-per-prompt.video 0 --async-scheduling --mm-processor-cache-gb 0 --max-model-len 8000 --max_num_seqs 1 --gpu-memory-utilization 0.95 --port 8001
CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-VL-4B-Instruct --limit-mm-per-prompt.video 0 --async-scheduling --mm-processor-cache-gb 0 --max-model-len 8000 --max_num_seqs 1 --gpu-memory-utilization 0.95 --port 8002

i have a test script that averages the response time for some 20 images for each of the models, and the results are very unintuitive to me. Does anyone know what the heck is going on?

— Performance Statistics —

Model: Qwen/Qwen3-VL-30B-A3B-Instruct
Average: 0.47s
Min: 0.41s
Max: 0.58s
Median: 0.46s

Model: Qwen/Qwen3-VL-8B-Instruct
Average: 1.04s
Min: 0.92s
Max: 1.29s
Median: 1.03s

Model: Qwen/Qwen3-VL-4B-Instruct
Average: 0.60s
Min: 0.55s
Max: 0.77s
Median: 0.58s

RunLLM · October 21, 2025, 10:02am

Your results—where the 30B model is faster than the 8B and 4B models—are counterintuitive but can be explained by GPU utilization and parallelism. When you run the 30B model with tensor parallelism across 2 GPUs, you effectively double the compute and memory bandwidth available for that model, allowing it to process requests faster than the smaller models running on a single GPU. The 8B and 4B models, each on a single GPU, may be bottlenecked by compute or memory, especially if the A6000 is not fully saturated or if the batch size is small (as with --max_num_seqs 1). This effect is documented in vLLM discussions, where higher tensor parallelism can yield better throughput and lower latency for large models, while small models may not benefit and can even be slower due to underutilization or lack of parallelism (source, source).

Additionally, vLLM’s scheduling and batching are optimized for high throughput with larger models and higher parallelism, so single-sequence, single-GPU runs may not reflect optimal performance. For small models, running multiple concurrent requests or using data parallelism (multiple replicas) is recommended to better utilize the hardware (source). Would you like a step-by-step breakdown or tuning suggestions for your setup?

Sources:

https://github.com/vllm-project/vllm/issues/9330

Hint: Mention @RunLLM in the post for followups.

jere357 · October 21, 2025, 10:08am

ok runLLM, i tried placing the 8B model on 2 GPUs, the results still favor the 30B one. ?

CUDA_VISIBLE_DEVICES=2,3 vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct --tensor-parallel-size 2 --limit-mm-per-prompt.video 0 --async-scheduling --mm-processor-cache-gb 0 --max-model-len 8000 --max_num_seqs 1 --gpu-memory-utilization 0.95

CUDA_VISIBLE_DEVICES=0,1 vllm serve Qwen/Qwen3-VL-8B-Instruct --tensor-parallel-size 2 --limit-mm-per-prompt.video 0 --async-scheduling --mm-processor-cache-gb 0 --max-model-len 8000 --max_num_seqs 1 --gpu-memory-utilization 0.95 --port 8001

— Performance Statistics —

Model: Qwen/Qwen3-VL-30B-A3B-Instruct
Average: 0.40s
Min: 0.37s
Max: 0.53s
Median: 0.39s

Model: Qwen/Qwen3-VL-8B-Instruct
Average: 0.91s
Min: 0.54s
Max: 6.57s
Median: 0.62s

what do you think about these results? @RunLLM

DarkLight1337 · October 21, 2025, 11:06am

The 30B model has 3B activated parameters, so it makes sense that during inference it has the performance of a 3B model.

jere357 · October 21, 2025, 11:18am

yepyep, just figured it out looking at the configs, didnt know that -A3B stands for active 3b + moe. Very impressive that they only need 10% of parameters!!

thx anyways!

Topic		Replies	Views
Unexpected performance gap when reproducing PR #25337 benchmark results on A100 PCIe using Qwen3-VL-30B-A3B-FP8 General	1	322	October 15, 2025
Critique my vLLM configuration for qwen3-coder-next Model Support	3	225	March 10, 2026
Does vllm inference work with Qwen3-VL-30B General	3	510	November 30, 2025
The service carrying capacity and computing efficiency issues of vllm General	0	39	July 16, 2025
Suggestion to improve inferencing speed Model Support	17	762	March 11, 2026

Weird benchmarking results regardin QwenVL/30B/8B/4B [solved - moe xd]

Related topics