Problem with Qwen2.5-VL-7b

Ares54 · July 9, 2025, 7:10pm

Hello, I serve Qwen/Qwen2.5-VL-7b and I’m getting only 10 t/s on A100 GPUs. I wonder how to make it faster.

python -m vllm.entrypoints.openai.api_server --model /workspace/models/qwen2.5-vl-7b --served-model-name qwen2.5-vl-7b --host 0.0.0.0 --port 8000 --dtype bfloat16 --max-model-len 2048 --gpu-memory-utilization 0.90 --max-num-seqs 32 --max-num-batched-tokens 2048 --enforce-eager --disable-custom-all-reduce --limit-mm-per-prompt image=2 --enable-prefix-caching

RunLLM · July 9, 2025, 7:10pm

To improve throughput for Qwen/Qwen2.5-VL-7B on A100, increase max_num_batched_tokens (e.g., >8096 for large GPUs), and consider raising max_num_seqs if memory allows. Also, avoid --enforce-eager unless necessary, as disabling CUDA graphs can reduce performance. For multi-GPU setups, use tensor_parallel_size to shard the model and further boost throughput. Example:

from vllm import LLM
llm = LLM(
    model="Qwen/Qwen2.5-VL-7B-Instruct",
    max_num_batched_tokens=16384,
    max_num_seqs=64,
    tensor_parallel_size=4  # if using 4 GPUs
)

See Performance Tuning with Chunked Prefill and Parallelism Strategies for more details.

Would you like a step-by-step breakdown or more tuning suggestions?

Sources:

Performance Tuning with Chunked Prefill

Ares54 · July 9, 2025, 7:45pm

vLLM Performance Issue with Qwen2.5-VL-7B-Instruct

Setup

Platform: Vast.ai A100 80GB
Model: Qwen/Qwen2.5-VL-7B-Instruct (~15.6GB VRAM usage)
vLLM Version: 0.9.2
Backend: Flash Attention (Flash-attn working, but FlashInfer not available)

Current Performance Issue

Extremely low throughput: Only getting 1.0-2.0 tokens/s generation speed on A100 80GB, which seems unusually slow for a 7B model.

Current Configuration

bash

vllm serve /workspace/models/qwen2.5-vl-7b \
    --host 0.0.0.0 \
    --port 8000 \
    --dtype bfloat16 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.95 \
    --max-num-seqs 64 \
    --max-num-batched-tokens 16384 \
    --served-model-name qwen2.5-vl-7b

Key Observations

Model loads successfully: 15.6GB VRAM usage, loads in ~5 seconds
CUDA graphs enabled: Graph capturing completes successfully (23 secs)
Low GPU utilization: KV cache usage shows 0.0% most of the time
Warning present: “sequence length (4096) is smaller than pre-defined worst-case multimodal tokens (32768)”
FlashInfer unavailable: Falling back to PyTorch-native implementation

Previous Attempts

Tried aggressive batching settings (16384 batched tokens, 64 seqs) - worse performance
Tried enforce-eager mode - got ~10 t/s but still suboptimal
Tried lower memory utilization (0.85, 0.90) - similar results
Tried smaller batch sizes - marginal improvement to ~10 t/s

Hardware Details

GPU: A100 80GB
Available KV cache: 19.27 GiB
Maximum concurrency: 88.09x for 4096 token requests
CUDA graphs: Working properly

Looking for advice on whether this performance is expected or if there are specific optimizations for Qwen2.5-VL models in vLLM.

RunLLM · July 9, 2025, 7:46pm

Throughput of 1–2 tokens/s on A100 80GB for Qwen2.5-VL-7B-Instruct is much lower than expected. This is likely due to the model’s multimodal (vision-language) nature, which is significantly slower than pure text models, especially if the input includes images or if the multimodal processor is not optimized. Also, the warning about “sequence length (4096) is smaller than pre-defined worst-case multimodal tokens (32768)” suggests the model is reserving excessive memory for multimodal processing, which can further reduce performance. FlashInfer is not available for VLMs, so you are limited to PyTorch or FlashAttention backends, which are slower for vision tasks.

To improve performance, try reducing the multimodal processor’s memory footprint by setting a lower max_pixels in mm_processor_kwargs (e.g., max_pixels=768*768), and limit the number of images per prompt. If you only need text, set limit_mm_per_prompt={"image": 0}. For VLMs, throughput is inherently lower than text-only models, but you should expect at least 10–20 t/s for text-only prompts on A100. See Reducing Memory Usage and Multimodal Input Limits for details.

Would you like a step-by-step guide to applying these settings?