Vllm throughput less on 7B in comparison to 32B

Griffintaur · June 25, 2025, 10:33am

I am trying to launch two models like Qwen-32B-VL and Molmo-7B. I was able to get the generation throughput of approx 1000 token/sec per vllm instance for Qwen 32 B model ..but when I switch to 7B model of Molmo-7B-D the generation throughput drops to approx 150 token/sec per vllm instance ? Any idea how to get high generation throughput on 7B model even though prompt throughput is higher in both of these models. say more than 1k .

RunLLM · June 25, 2025, 10:34am

The low generation throughput (~150 tokens/sec) for Molmo-7B compared to Qwen-32B-VL (~1000 tokens/sec) is not expected based on model size alone; typically, smaller models like 7B should have higher generation throughput. However, there are known issues with Molmo-7B-D performance in vLLM, and this model may not be as optimized as others. According to this issue, users have reported slow generation speeds for Molmo-7B-D even on high-end GPUs, and there is no official solution or optimization guidance specific to this model in the vLLM documentation.

If prompt throughput is high but generation throughput is low, it may be due to model-specific inefficiencies, custom layers, or lack of kernel optimizations for Molmo-7B-D. There are no vLLM configuration flags or parameters in the provided data sources that specifically address or resolve this performance gap for Molmo-7B-D. For other models, increasing batch size, tuning max_num_batched_tokens, and using optimized attention backends can help, but these are general tips and may not overcome model-specific bottlenecks. Would you like a step-by-step breakdown of general vLLM throughput tuning strategies?

Sources:

Topic		Replies	Views
Help needed with vLLM Qwen3-32B inference performance tuning (long input slow) General	1	126	May 26, 2025
Speeding up vllm inference for Qwen2.5-VL General	23	894	June 27, 2025
V1 has lower end-to-end performance than V0(--num-scheduler-steps=8) General	1	8	June 11, 2025
High-Throughput kernel on single-node Benchmarking	1	7	June 23, 2025
Ascend-vllm中怎么指定batch和seqlen来测试性能 General	4	23	June 27, 2025

Vllm throughput less on 7B in comparison to 32B

Related topics