Background
Recently, we have been working on optimizing the position computation for multimodal models in vLLM.
During benchmarking, we noticed that our results were not as expected.
To investigate, we decided to reproduce the benchmark results from PR #25337, comparing the performance before and after that PR was merged into the main branch.
-
Before PR commit: cf56cf78b47e5f9b6a81ce0d50a94f9291922315
-
After PR commit: 30d08911f7cf78287f8da003ddcc99f6ef196f9f
However, our reproduced results differ significantly from the performance data reported in the PR.
We’d like to understand whether this discrepancy may be caused by hardware differences, model choice, or benchmark setup.
Who can help guide me?
Model and Environment
- Model used:
Qwen/Qwen3-VL-30B-A3B-Instruct-FP8
(The model
Qwen3-VL-4B used in the PR could not be found on Hugging Face.)
- GPU:
NVIDIA A100 PCIe
- vLLM startup command:
vllm serve "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8" \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--max-model-len 16384
Benchmark Command
vllm bench serve \
--backend openai-chat \
--model "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8" \
--base-url "http://localhost:8000" \
--endpoint "/v1/chat/completions" \
--dataset-name "hf" \
--dataset-path "lmarena-ai/VisionArena-Chat" \
--num-prompts 100 \
--request-rate 10 \
--save-result \
--result-dir benchmarks_results \
--result-filename test.json
Benchmark Results
Before PR #25337
============ Serving Benchmark Result ============
Successful requests: 100
Request rate configured (RPS): 10.00
Benchmark duration (s): 16.91
Total input tokens: 5280
Total generated tokens: 11522
Request throughput (req/s): 5.91
Output token throughput (tok/s): 681.42
Peak output token throughput (tok/s): 2225.00
Peak concurrent requests: 97.00
Total Token throughput (tok/s): 993.68
---------------Time to First Token----------------
Mean TTFT (ms): 1176.13
Median TTFT (ms): 1185.79
P99 TTFT (ms): 2178.91
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 88.39
Median TPOT (ms): 78.68
P99 TPOT (ms): 392.01
---------------Inter-token Latency----------------
Mean ITL (ms): 77.30
Median ITL (ms): 42.31
P99 ITL (ms): 581.15
==================================================
After PR #25337
============ Serving Benchmark Result ============
Successful requests: 100
Request rate configured (RPS): 10.00
Benchmark duration (s): 16.89
Total input tokens: 5280
Total generated tokens: 11640
Request throughput (req/s): 5.92
Output token throughput (tok/s): 689.02
Peak output token throughput (tok/s): 2178.00
Peak concurrent requests: 97.00
Total Token throughput (tok/s): 1001.57
---------------Time to First Token----------------
Mean TTFT (ms): 1193.52
Median TTFT (ms): 1285.23
P99 TTFT (ms): 2111.41
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 88.84
Median TPOT (ms): 78.00
P99 TPOT (ms): 344.25
---------------Inter-token Latency----------------
Mean ITL (ms): 76.89
Median ITL (ms): 42.30
P99 ITL (ms): 597.42
==================================================
Reference: Benchmark Results from PR #25337
Main branch
============ Serving Benchmark Result ============
Successful requests: 1000
Request rate configured (RPS): 10.00
Benchmark duration (s): 101.85
Total input tokens: 94327
Total generated tokens: 120882
Request throughput (req/s): 9.82
Output token throughput (tok/s): 1186.81
Peak output token throughput (tok/s): 2862.00
Peak concurrent requests: 133.00
Total Token throughput (tok/s): 2112.91
---------------Time to First Token----------------
Mean TTFT (ms): 229.53
Median TTFT (ms): 180.19
P99 TTFT (ms): 928.83
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 40.65
Median TPOT (ms): 36.29
P99 TPOT (ms): 87.93
---------------Inter-token Latency----------------
Mean ITL (ms): 39.96
Median ITL (ms): 17.36
P99 ITL (ms): 186.27
==================================================
This branch
============ Serving Benchmark Result ============
Successful requests: 1000
Request rate configured (RPS): 10.00
Benchmark duration (s): 101.66
Total input tokens: 94327
Total generated tokens: 120735
Request throughput (req/s): 9.84
Output token throughput (tok/s): 1187.67
Peak output token throughput (tok/s): 2310.00
Peak concurrent requests: 124.00
Total Token throughput (tok/s): 2115.57
---------------Time to First Token----------------
Mean TTFT (ms): 203.78
Median TTFT (ms): 162.26
P99 TTFT (ms): 848.32
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 36.27
Median TPOT (ms): 31.53
P99 TPOT (ms): 80.10
---------------Inter-token Latency----------------
Mean ITL (ms): 36.00
Median ITL (ms): 16.07
P99 ITL (ms): 170.49
==================================================
Question
The results we obtained are noticeably different from the benchmark numbers shown in PR #25337.
Could this gap be explained by differences such as:
- Model (
Qwen3-VL-4B vs. Qwen3-VL-30B-A3B-Instruct-FP8)
- Hardware (A100
PCIe vs. SXM)
- Dataset or benchmarking parameters
Has anyone else tried reproducing this PR and observed similar discrepancies?
Thanks in advance for any help or clarification!
