Using vLLM version 0.11.0 to run the qwen3-vl-30b-a3b model, the stress test results show that although the number of input tokens decreases, the latency does not change.
The model is deployed on a single A800 GPU. The startup command is:
vllm server
–dtype bfloat16
–max-model-len 128000
–gpu-memory-utilization 0.95
–limit-mm-per-prompt.video 0
I performed a stress test using one image and a set of text prompts, with QPS set to 10.
I resized the image to 0.25x and 0.7x of the original size while keeping everything else unchanged.
The conclusions are as follows:
qwen3-30b-a3b (single image *0.25) latency 3s
qwen3-30b-a3b (single image *0.7) latency 5s
qwen3-30b-a3b (single image) latency 5s
Prior conditions:
Input token scale / Output token scale
Single image + text prompts: about 4200 / about 70
Single image *0.6 + text prompts: about 1900 / about 70
Single image *0.3 + text prompts: about 860 / about 70
Decreasing the input token count by resizing the image does not always reduce latency in vLLM with Qwen3-VL-30B-A3B, as observed in your tests. This is a known behavior and can be due to factors such as fixed overheads in multimodal processing, vision encoder execution, or batching/scheduling constraints that are not directly proportional to input token length. In your case, only the smallest image (0.25x) led to a noticeable latency drop, while 0.7x and full-size images had similar latencies despite different token counts, indicating non-linear scaling of latency with input size for this model and setup.
This phenomenon has been reported by other users, and no explicit solution or parameter is documented in the data sources to guarantee linear latency reduction with fewer input tokens for Qwen3-VL models in vLLM 0.11.0. The issue may relate to how the vision encoder and multimodal pipeline are implemented, and further tuning or profiling may be needed for optimal performance in such scenarios. For more details, see the discussion at vLLM GitHub Issue #30941.
Would you like a deeper technical breakdown of the possible causes and optimization strategies?