Why Does Latency Remain Unchanged in vLLM 0.11.0 When Input Token Count Decreases for qwen3-vl-30b-a3b?

Hormoney · January 13, 2026, 9:30pm

Using vLLM version 0.11.0 to run the qwen3-vl-30b-a3b model, the stress test results show that although the number of input tokens decreases, the latency does not change.

The model is deployed on a single A800 GPU. The startup command is:
vllm server
–dtype bfloat16
–max-model-len 128000
–gpu-memory-utilization 0.95
–limit-mm-per-prompt.video 0

I performed a stress test using one image and a set of text prompts, with QPS set to 10.
I resized the image to 0.25x and 0.7x of the original size while keeping everything else unchanged.

The conclusions are as follows:
qwen3-30b-a3b (single image *0.25) latency 3s
qwen3-30b-a3b (single image *0.7) latency 5s
qwen3-30b-a3b (single image) latency 5s

Prior conditions:
Input token scale / Output token scale
Single image + text prompts: about 4200 / about 70
Single image *0.6 + text prompts: about 1900 / about 70
Single image *0.3 + text prompts: about 860 / about 70

RunLLM · January 13, 2026, 9:30pm

Decreasing the input token count by resizing the image does not always reduce latency in vLLM with Qwen3-VL-30B-A3B, as observed in your tests. This is a known behavior and can be due to factors such as fixed overheads in multimodal processing, vision encoder execution, or batching/scheduling constraints that are not directly proportional to input token length. In your case, only the smallest image (0.25x) led to a noticeable latency drop, while 0.7x and full-size images had similar latencies despite different token counts, indicating non-linear scaling of latency with input size for this model and setup.

This phenomenon has been reported by other users, and no explicit solution or parameter is documented in the data sources to guarantee linear latency reduction with fewer input tokens for Qwen3-VL models in vLLM 0.11.0. The issue may relate to how the vision encoder and multimodal pipeline are implemented, and further tuning or profiling may be needed for optimal performance in such scenarios. For more details, see the discussion at vLLM GitHub Issue #30941.

Would you like a deeper technical breakdown of the possible causes and optimization strategies?

Sources:

vLLM GitHub Issue #30941

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Weird benchmarking results regardin QwenVL/30B/8B/4B [solved - moe xd] General	4	762	October 21, 2025
Help needed with vLLM Qwen3-32B inference performance tuning (long input slow) General	1	1374	May 26, 2025
Suggestion to improve inferencing speed Model Support	17	803	March 11, 2026
请问我在使用vllm 推理qwen3-vl的时候多次请求的过程中比如发起5个请求后，发送的内容是一致的，前面4个请求相对较快返回，最后一个请求，也在running没有处于waiting中，但是等待的结果却比他们慢了很多，比如前面四个可能40s能返回，最后一个需要6分钟 General	1	90	December 11, 2025
what is the recommended method to start up the vllm server engine for inferencing for InternVL3_5-8B, getting 2 qps? General	1	133	September 14, 2025

Why Does Latency Remain Unchanged in vLLM 0.11.0 When Input Token Count Decreases for qwen3-vl-30b-a3b?

Related topics