Problem with Qwen2.5-VL-7b

To improve throughput for Qwen/Qwen2.5-VL-7B on A100, increase max_num_batched_tokens (e.g., >8096 for large GPUs), and consider raising max_num_seqs if memory allows. Also, avoid --enforce-eager unless necessary, as disabling CUDA graphs can reduce performance. For multi-GPU setups, use tensor_parallel_size to shard the model and further boost throughput. Example:

from vllm import LLM
llm = LLM(
    model="Qwen/Qwen2.5-VL-7B-Instruct",
    max_num_batched_tokens=16384,
    max_num_seqs=64,
    tensor_parallel_size=4  # if using 4 GPUs
)

See Performance Tuning with Chunked Prefill and Parallelism Strategies for more details.

Would you like a step-by-step breakdown or more tuning suggestions?

Sources: