Weird benchmarking results regardin QwenVL/30B/8B/4B [solved - moe xd]

Hey people, im setting up a vllm server on a machine with 4 GPUs (A6000), and im running into some bizzare results while benchmarking avg request time for each models size.

my nvidia drivers are 535.65, CUDA version from nvcc --version is 11.5

the 30B model is placed on 2 GPUs using tensor-paralel 2

CUDA_VISIBLE_DEVICES=2,3 vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct --tensor-parallel-size 2 --limit-mm-per-prompt.video 0 --async-scheduling --mm-processor-cache-gb 0 --max-model-len 8000 --max_num_seqs 1 --gpu-memory-utilization 0.95
CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-VL-8B-Instruct --limit-mm-per-prompt.video 0 --async-scheduling --mm-processor-cache-gb 0 --max-model-len 8000 --max_num_seqs 1 --gpu-memory-utilization 0.95 --port 8001
CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-VL-4B-Instruct --limit-mm-per-prompt.video 0 --async-scheduling --mm-processor-cache-gb 0 --max-model-len 8000 --max_num_seqs 1 --gpu-memory-utilization 0.95 --port 8002

i have a test script that averages the response time for some 20 images for each of the models, and the results are very unintuitive to me. Does anyone know what the heck is going on?

— Performance Statistics —

Model: Qwen/Qwen3-VL-30B-A3B-Instruct
Average: 0.47s
Min: 0.41s
Max: 0.58s
Median: 0.46s

Model: Qwen/Qwen3-VL-8B-Instruct
Average: 1.04s
Min: 0.92s
Max: 1.29s
Median: 1.03s

Model: Qwen/Qwen3-VL-4B-Instruct
Average: 0.60s
Min: 0.55s
Max: 0.77s
Median: 0.58s

Your results—where the 30B model is faster than the 8B and 4B models—are counterintuitive but can be explained by GPU utilization and parallelism. When you run the 30B model with tensor parallelism across 2 GPUs, you effectively double the compute and memory bandwidth available for that model, allowing it to process requests faster than the smaller models running on a single GPU. The 8B and 4B models, each on a single GPU, may be bottlenecked by compute or memory, especially if the A6000 is not fully saturated or if the batch size is small (as with --max_num_seqs 1). This effect is documented in vLLM discussions, where higher tensor parallelism can yield better throughput and lower latency for large models, while small models may not benefit and can even be slower due to underutilization or lack of parallelism (source, source).

Additionally, vLLM’s scheduling and batching are optimized for high throughput with larger models and higher parallelism, so single-sequence, single-GPU runs may not reflect optimal performance. For small models, running multiple concurrent requests or using data parallelism (multiple replicas) is recommended to better utilize the hardware (source). Would you like a step-by-step breakdown or tuning suggestions for your setup?

Sources:

Hint: Mention @RunLLM in the post for followups.

ok runLLM, i tried placing the 8B model on 2 GPUs, the results still favor the 30B one. ?

CUDA_VISIBLE_DEVICES=2,3 vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct --tensor-parallel-size 2 --limit-mm-per-prompt.video 0 --async-scheduling --mm-processor-cache-gb 0 --max-model-len 8000 --max_num_seqs 1 --gpu-memory-utilization 0.95

CUDA_VISIBLE_DEVICES=0,1 vllm serve Qwen/Qwen3-VL-8B-Instruct --tensor-parallel-size 2 --limit-mm-per-prompt.video 0 --async-scheduling --mm-processor-cache-gb 0 --max-model-len 8000 --max_num_seqs 1 --gpu-memory-utilization 0.95 --port 8001

— Performance Statistics —

Model: Qwen/Qwen3-VL-30B-A3B-Instruct
Average: 0.40s
Min: 0.37s
Max: 0.53s
Median: 0.39s

Model: Qwen/Qwen3-VL-8B-Instruct
Average: 0.91s
Min: 0.54s
Max: 6.57s
Median: 0.62s

what do you think about these results? @RunLLM

The 30B model has 3B activated parameters, so it makes sense that during inference it has the performance of a 3B model.

1 Like

yepyep, just figured it out looking at the configs, didnt know that -A3B stands for active 3b + moe. Very impressive that they only need 10% of parameters!!

thx anyways!