Speeding up vllm inference for Qwen2.5-VL

Looking for suggestions or flags where I can speed up the inference for extracting info from an image.
I’m currently using Qwen2.5-VL-7B on A100 GPUs

vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
 --port 6161 --host 0.0.0.0 \
 --limit-mm-per-prompt image=1 \
 --max_model_len=5000 \
 --enable-chunked-prefill \
 --dtype bfloat16

and I’m calling a prompt like
Extract the key-value information from the image in one line in this format {"name": "xx", "birthday": "xx", "issued_date": "xx", "expiry_date": "xx", "id_number": "xx", "birthplace": "xx", "nationality": "xx"}

I’m getting
Avg prompt throughput: 764.0 tokens/s, Avg generation throughput: 132.4 tokens/s (whether this token speed looks normal)
and each requests takes about 2s for the whole response to be completed. Ideally, I need the whole response in under 1s…

I’m wondering about the following

  1. There are flags I can try to speed up inference (I’ve tried quantized models, but the accuracy is lower)
  2. Whether something like instead of trying to get an 1 long answer in a json, I try asking 7 different questions in parallel from the same image. So, instead of asking it to return a json, we ask in parallel, what is the name, what is the birthday, so that 7 different answers are generated in parallel. Could this be faster?

Your current throughput (764 prompt tokens/s, 132.4 generation tokens/s) is within the expected range for Qwen2.5-VL-7B on A100, but total latency is also affected by image preprocessing, which can be a bottleneck for Qwen2-VL models. To further optimize, you can try reducing image size via the mm_processor_kwargs (e.g., set "max_pixels" lower), and adjust max_model_len and max_num_seqs to fit more requests in parallel if memory allows. Also, ensure you are using the latest vLLM version, as some serialization and preprocessing speedups have been added recently. See docs and issue.

Splitting your prompt into 7 parallel requests (one per field) will not be faster overall, since each request will still require image preprocessing and model inference, and the total compute will likely be higher. Batch inference (multiple prompts in a single call) is supported, but for multi-modal models, each batch still processes the image for each prompt, so you won’t see a significant speedup unless you can reuse the preprocessed image, which is not natively supported. See multimodal input docs and discussion.

Would you like more detail on specific flags or profiling image preprocessing?

Sources:

Which quantized models have you tried?
Have you tried:

  1. RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8 · Hugging Face
  2. RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w4a16 · Hugging Face

Both Qwen2.5-VL-7B-Instruct-quantized.w8a8 (INT8) and Qwen2.5-VL-7B-Instruct-quantized.w4a16 (INT4) are officially released quantized versions, optimized for vLLM and can be used directly with vLLM ≥ 0.5.2. They are designed for efficient inference and reduced memory usage, but some accuracy drop is expected compared to FP16/BF16 models, especially with INT4.

Would you like more detail on how to use these models with vLLM or on their performance tradeoffs?

Sources:

I haven’t tried those, I’ve tried Qwen/Qwen2.5-VL-7B-Instruct-AWQ · Hugging Face.
End goal is to fine-tune it and maybe quantize it (I’m using GitHub - hiyouga/LLaMA-Factory: Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)).

The Qwen/Qwen2.5-VL-7B-Instruct-AWQ model uses AWQ 4-bit quantization, which is supported by vLLM for efficient inference. For your workflow—fine-tuning and then quantizing—it’s generally recommended to fine-tune the model in full precision (FP16/BF16), then quantize the resulting checkpoint for best accuracy. LLaMA-Factory supports LoRA/QLoRA fine-tuning, but for quantization, you should use tools like llm-compressor or AutoAWQ after fine-tuning, as described in the vLLM quantization docs.

Would you like step-by-step guidance on the fine-tune-then-quantize workflow for Qwen2.5-VL-7B?

Sources:

How does flashinfer compare to flash attention. Should I use both, etc.

FlashInfer and FlashAttention are both optimized attention backends in vLLM. FlashAttention is the default for most models and is highly optimized for NVIDIA GPUs, while FlashInfer is designed for even faster decoding and supports FP8 quantization and GQA, but may not always outperform FlashAttention in all scenarios. You can only use one backend at a time, set via the VLLM_ATTENTION_BACKEND environment variable. Recent benchmarks show similar performance between the two on A100/H100, with FlashInfer sometimes slightly faster for large batch sizes or specific quantized models, but not universally so (benchmark, discussion).

If you use FP8 quantization or need GQA support, FlashInfer is recommended; otherwise, FlashAttention is a strong default. You cannot use both simultaneously, but you can benchmark both for your workload by switching the backend.

Would you like more detail on configuration or performance benchmarks?

Sources:

  1. What would be the recommend top_p, top_k, and temperature values? If my task is mainly OCR extraction from an image, and i want deterministic results.
  2. I sometimes get extraction that doesn’t terminate, “no_registro”: "0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000. Is there any setting to terminate if the same token has been repeated N times

Example prompt that vllm sees
‘<|im_start|>system\nYou are a helpful AI that extracts structured data.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Extract the key-value information from the image in one line in this format {“cpf”: “xx”, “birthday”: “xx”, “name”: “xx”, “id_number”: “xx”, “expiry_date”: “xx”, “father_name”: “xx”, “mother_name”: “xx”, “no_registro”: “xx”, “first_licence_date”: “xx”, “cnh_category”: “xx”}<|im_end|>\n<|im_start|>assistant\n’, params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.1, top_p=0.1, top_k=-1, min_p=0.0, seed=None, stop=, stop_token_ids=, bad_words=, include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.

For deterministic OCR extraction, set temperature=0, top_p=1.0, and top_k=0 (or -1). This ensures greedy decoding, which is the most repeatable and least random. Your current settings (temperature=0.1, top_p=0.1) introduce randomness and can cause inconsistent outputs (SamplingParams docs).

To prevent infinite repetition of the same token, vLLM does not natively support a “no_repeat_ngram_size” parameter, but you can implement a custom logits processor for this purpose. See this example for how to add a NoRepeatNGramLogitsProcessor to your sampling parameters.

Would you like a code example for adding a custom logits processor?

Sources:

Is it possible to remove or not prefer longer tokens

For instance, I see that in my log probs I have

TopLogprob(token=‘ise’, bytes=[105, 115, 101], logprob=-0.025667304173111916)
TopLogprob(token=‘í’, bytes=[195, 173], logprob=-4.150667190551758)
TopLogprob(token=‘isé’, bytes=[105, 115, 195, 169], logprob=-5.775667190551758)
TopLogprob(token=‘i’, bytes=[105], logprob=-6.025667190551758)
TopLogprob(token=‘ï’, bytes=[195, 175], logprob=-6.650667190551758)

The correct one was the second one, so is it possible to exclude tokens ise when I predict using vllm

Yes, you can exclude specific tokens or words (like “ise”) during generation in vLLM by using the bad_words parameter in SamplingParams. This will prevent the model from generating those tokens (SamplingParams docs). For example:

from vllm import SamplingParams
sampling_params = SamplingParams(bad_words=["ise"])

Would you like more detail on how to use bad_words for multi-token or subword exclusions?

Sources:

Will vllm work on a machine with Cuda 12.2 (driver 5.3.5) with 6 Tesla T4 GPU

vllm serve Qwen/Qwen2.5-VL-7B-Instruct --tensor-parallel-size 4 --enable-chunked-prefill --enable-prefix-caching --host 0.0.0.0 --port 6161 --dtype half

vLLM requires a CUDA GPU with compute capability 7.0 or higher (Tesla T4 is 7.5), and is compatible with CUDA 12.x. However, the official pre-built vLLM wheels are compiled with CUDA 12.6 by default; for CUDA 12.2, you may need to build vLLM from source for full compatibility (installation docs). Your driver version (5.3.5) is not standard NVIDIA notation—please confirm it’s at least 470+.

In summary: vLLM will work with 6×T4 GPUs and CUDA 12.2 if you build from source or use a compatible wheel. All your command flags are supported, but ensure you use dtype=half (T4 does not support bfloat16).

Would you like step-by-step instructions for building vLLM from source for CUDA 12.2?

Sources:

Qwen/Qwen2.5-VL-7B-Instruct has 28 heads I believe, so --tensor-parallel-size 6 doesn’t work gives me an error about Total number of attention heads must be divisible by N. Is there a work around, if I want to use 6 gpus, I don’t have 7 gpus

No, there is currently no official workaround: vLLM requires the total number of attention heads (28 for Qwen2.5-VL-7B-Instruct) to be divisible by --tensor-parallel-size, so 6 is not allowed (issue, issue). The only supported options are tensor parallel sizes of 1, 2, or 4 for this model.

As a workaround, you can use pipeline parallelism: set --tensor-parallel-size 1 --pipeline-parallel-size 6, which splits the model by layers instead of heads (docs). Note that pipeline parallelism may have higher latency and is less efficient for single requests.

Would you like more detail on pipeline parallelism or alternative deployment strategies?

Sources:

What quantization will work on V100 (Volta, cuda compute 7.0) and T4 (Tesla, cuda compute 7.5)

On V100 (compute 7.0), only GPTQ, BitBLAS (GPTQ), AQLM, bitsandbytes, DeepSpeedFP, and GGUF quantization methods are supported. On T4 (compute 7.5), you can use AWQ, GPTQ, Marlin (GPTQ/AWQ/FP8), INT8 (W8A8), BitBLAS (GPTQ), AQLM, bitsandbytes, DeepSpeedFP, GGUF, and modelopt. FP8 (W8A8) is not supported on either V100 or T4 (supported hardware table).

Would you like more detail on which quantization method is best for your use case?

Sources:

Among them which is the most accurate and fastest

Among quantization methods supported on V100 and T4, GPTQ and BitBLAS (GPTQ) generally offer the best balance of accuracy and speed, with BitBLAS providing more precision options and sometimes higher throughput (BitBLAS docs). AWQ is also fast on T4 but may have slightly lower accuracy than GPTQ, especially for complex tasks (supported hardware).

For maximum accuracy, use INT8 (W8A8) on T4, but for best speed-accuracy tradeoff, GPTQ or BitBLAS (GPTQ) are recommended.

Would you like benchmark data or configuration examples?

Sources: