I am trying to launch two models like Qwen-32B-VL and Molmo-7B. I was able to get the generation throughput of approx 1000 token/sec per vllm instance for Qwen 32 B model ..but when I switch to 7B model of Molmo-7B-D the generation throughput drops to approx 150 token/sec per vllm instance ? Any idea how to get high generation throughput on 7B model even though prompt throughput is higher in both of these models. say more than 1k .
The low generation throughput (~150 tokens/sec) for Molmo-7B compared to Qwen-32B-VL (~1000 tokens/sec) is not expected based on model size alone; typically, smaller models like 7B should have higher generation throughput. However, there are known issues with Molmo-7B-D performance in vLLM, and this model may not be as optimized as others. According to this issue, users have reported slow generation speeds for Molmo-7B-D even on high-end GPUs, and there is no official solution or optimization guidance specific to this model in the vLLM documentation.
If prompt throughput is high but generation throughput is low, it may be due to model-specific inefficiencies, custom layers, or lack of kernel optimizations for Molmo-7B-D. There are no vLLM configuration flags or parameters in the provided data sources that specifically address or resolve this performance gap for Molmo-7B-D. For other models, increasing batch size, tuning max_num_batched_tokens
, and using optimized attention backends can help, but these are general tips and may not overcome model-specific bottlenecks. Would you like a step-by-step breakdown of general vLLM throughput tuning strategies?
Sources: