Unexpected performance gap when reproducing PR #25337 benchmark results on A100 PCIe using Qwen3-VL-30B-A3B-FP8

Background

Recently, we have been working on optimizing the position computation for multimodal models in vLLM.

During benchmarking, we noticed that our results were not as expected.

To investigate, we decided to reproduce the benchmark results from PR #25337, comparing the performance before and after that PR was merged into the main branch.

  • Before PR commit: cf56cf78b47e5f9b6a81ce0d50a94f9291922315

  • After PR commit: 30d08911f7cf78287f8da003ddcc99f6ef196f9f

However, our reproduced results differ significantly from the performance data reported in the PR.

We’d like to understand whether this discrepancy may be caused by hardware differences, model choice, or benchmark setup.

Who can help guide me?

Model and Environment

  • Model used:

Qwen/Qwen3-VL-30B-A3B-Instruct-FP8

(The model

Qwen3-VL-4B used in the PR could not be found on Hugging Face.)

  • GPU:

NVIDIA A100 PCIe

  • vLLM startup command:
vllm serve "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8" \
    --trust-remote-code \
    --gpu-memory-utilization 0.9 \
    --max-model-len 16384

Benchmark Command

vllm bench serve \
  --backend openai-chat \
  --model "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8" \
  --base-url "http://localhost:8000" \
  --endpoint "/v1/chat/completions" \
  --dataset-name "hf" \
  --dataset-path "lmarena-ai/VisionArena-Chat" \
  --num-prompts 100 \
  --request-rate 10 \
  --save-result \
  --result-dir benchmarks_results \
  --result-filename test.json

Benchmark Results

Before PR #25337

============ Serving Benchmark Result ============
Successful requests:                     100
Request rate configured (RPS):           10.00
Benchmark duration (s):                  16.91
Total input tokens:                      5280
Total generated tokens:                  11522
Request throughput (req/s):              5.91
Output token throughput (tok/s):         681.42
Peak output token throughput (tok/s):    2225.00
Peak concurrent requests:                97.00
Total Token throughput (tok/s):          993.68
---------------Time to First Token----------------
Mean TTFT (ms):                          1176.13
Median TTFT (ms):                        1185.79
P99 TTFT (ms):                           2178.91
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          88.39
Median TPOT (ms):                        78.68
P99 TPOT (ms):                           392.01
---------------Inter-token Latency----------------
Mean ITL (ms):                           77.30
Median ITL (ms):                         42.31
P99 ITL (ms):                            581.15
==================================================

After PR #25337

============ Serving Benchmark Result ============
Successful requests:                     100
Request rate configured (RPS):           10.00
Benchmark duration (s):                  16.89
Total input tokens:                      5280
Total generated tokens:                  11640
Request throughput (req/s):              5.92
Output token throughput (tok/s):         689.02
Peak output token throughput (tok/s):    2178.00
Peak concurrent requests:                97.00
Total Token throughput (tok/s):          1001.57
---------------Time to First Token----------------
Mean TTFT (ms):                          1193.52
Median TTFT (ms):                        1285.23
P99 TTFT (ms):                           2111.41
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          88.84
Median TPOT (ms):                        78.00
P99 TPOT (ms):                           344.25
---------------Inter-token Latency----------------
Mean ITL (ms):                           76.89
Median ITL (ms):                         42.30
P99 ITL (ms):                            597.42
==================================================

Reference: Benchmark Results from PR #25337

Main branch

============ Serving Benchmark Result ============
Successful requests:                     1000      
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  101.85    
Total input tokens:                      94327     
Total generated tokens:                  120882    
Request throughput (req/s):              9.82      
Output token throughput (tok/s):         1186.81   
Peak output token throughput (tok/s):    2862.00   
Peak concurrent requests:                133.00    
Total Token throughput (tok/s):          2112.91   
---------------Time to First Token----------------
Mean TTFT (ms):                          229.53    
Median TTFT (ms):                        180.19    
P99 TTFT (ms):                           928.83    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          40.65     
Median TPOT (ms):                        36.29     
P99 TPOT (ms):                           87.93     
---------------Inter-token Latency----------------
Mean ITL (ms):                           39.96     
Median ITL (ms):                         17.36     
P99 ITL (ms):                            186.27    
==================================================

This branch

============ Serving Benchmark Result ============
Successful requests:                     1000      
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  101.66    
Total input tokens:                      94327     
Total generated tokens:                  120735    
Request throughput (req/s):              9.84      
Output token throughput (tok/s):         1187.67   
Peak output token throughput (tok/s):    2310.00   
Peak concurrent requests:                124.00    
Total Token throughput (tok/s):          2115.57   
---------------Time to First Token----------------
Mean TTFT (ms):                          203.78    
Median TTFT (ms):                        162.26    
P99 TTFT (ms):                           848.32    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.27     
Median TPOT (ms):                        31.53     
P99 TPOT (ms):                           80.10     
---------------Inter-token Latency----------------
Mean ITL (ms):                           36.00     
Median ITL (ms):                         16.07     
P99 ITL (ms):                            170.49    
==================================================

Question

The results we obtained are noticeably different from the benchmark numbers shown in PR #25337.

Could this gap be explained by differences such as:

  • Model (

Qwen3-VL-4B vs. Qwen3-VL-30B-A3B-Instruct-FP8)

  • Hardware (A100

PCIe vs. SXM)

  • Dataset or benchmarking parameters

Has anyone else tried reproducing this PR and observed similar discrepancies?


:folded_hands: Thanks in advance for any help or clarification!


1 Like

It seems likely that the large performance gap you observed compared to the results in PR #25337 is due to a combination of factors: model size (Qwen3-VL-4B vs. Qwen3-VL-30B-A3B-Instruct-FP8), hardware differences (A100 PCIe vs. possibly A100 SXM or other), and especially the number of prompts and concurrency settings. The PR’s benchmarks used 1000 prompts and reported much higher throughput and lower latency, while your runs used only 100 prompts, which can significantly affect server warmup, batching, and steady-state performance. The model used in the PR (Qwen3-VL-4B) is also smaller and likely faster than the 30B model you used, which would further explain the lower throughput and higher latency in your results. The dataset and benchmarking parameters otherwise appear consistent with the PR setup, as both use the VisionArena-Chat dataset and similar vLLM commands. According to the PR #25337, the main branch and PR branch were both tested with 1000 prompts, and the performance numbers (TTFT, TPOT, ITL) are much better than what you report for 100 prompts.

My understanding is that to reproduce the PR’s results more closely, you should (1) use the same model (Qwen3-VL-4B, if available), (2) match the hardware as closely as possible (A100 SXM if that was used), and (3) increase the number of prompts to 1000 and ensure max concurrency is set high enough to saturate the server. Also, ensure that the benchmarking tool is run long enough for the server to reach steady-state performance, as short runs with few prompts can underutilize the model and skew latency/throughput metrics. Others have also noted that model size and hardware can have a large impact on vLLM benchmarks, and that prompt count and concurrency are critical for fair comparison (PR #25337). Would you like a step-by-step breakdown of how to set up a more comparable benchmark?

Sources:

Hint: Mention @RunLLM in the post for followups.