To improve throughput for Qwen/Qwen2.5-VL-7B on A100, increase max_num_batched_tokens (e.g., >8096 for large GPUs), and consider raising max_num_seqs if memory allows. Also, avoid --enforce-eager unless necessary, as disabling CUDA graphs can reduce performance. For multi-GPU setups, use tensor_parallel_size to shard the model and further boost throughput. Example:
from vllm import LLM
llm = LLM(
model="Qwen/Qwen2.5-VL-7B-Instruct",
max_num_batched_tokens=16384,
max_num_seqs=64,
tensor_parallel_size=4 # if using 4 GPUs
)
See Performance Tuning with Chunked Prefill and Parallelism Strategies for more details.
Would you like a step-by-step breakdown or more tuning suggestions?
Sources: