Is the H200 QWEN3-VL-4B-FP8 performing poorly?

vele · November 19, 2025, 6:58am

I have one H200 140GB GPU video card, I run it with these parameters vllm serve Qwen/Qwen3-VL-4B-Instruct-FP8 --limit-mm-per-prompt.video 0 --quantization fp8 --kv-cache-dtype fp8_e4m3 --max-model-len 2048 --dtype bfloat16 --max-num-batched-tokens 16384 --port 80 --download_dir /model --gpu-memory-utilization 0.9 --no-enforce-eager --trust-remote-code --enable-chunked-prefill, but I get benchmark results

============ Serving Benchmark Result ============
Successful requests: 6177
Request rate configured (RPS): 48.00
Benchmark duration (s): 416.82
Total input tokens: 287562
Total generated tokens: 754172
Request throughput (req/s): 14.82
Output token throughput (tok/s): 1809.37
Peak output token throughput (tok/s): 6109.00
Peak concurrent requests: 3751.00
Total Token throughput (tok/s): 2499.27
---------------Time to First Token----------------
Mean TTFT (ms): 121584.46
Median TTFT (ms): 160674.25
P99 TTFT (ms): 235113.81
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 79.94
Median TPOT (ms): 76.32
P99 TPOT (ms): 283.87
---------------Inter-token Latency----------------
Mean ITL (ms): 133.72
Median ITL (ms): 67.87
P99 ITL (ms): 1233.26

What am I doing wrong?

RunLLM · November 19, 2025, 6:59am

Your configuration is likely causing high Time to First Token (TTFT) due to a mismatch between the request rate, batch size, and model context length. With --max-num-batched-tokens 16384 and a high request rate (48 RPS), many requests are queued, leading to long TTFT. Also, using both --quantization fp8 and --kv-cache-dtype fp8_e4m3 may not provide additional benefit and could introduce overhead, especially if the model is already in FP8. For Qwen3-VL-4B-Instruct-FP8, the recommended context length is higher (e.g., 8192), and chunked prefill is more beneficial for long contexts, not short ones like 2048 tokens. Try lowering --max-num-batched-tokens (e.g., 4096), reducing the request rate, and increasing --max-model-len if memory allows. Also, consider removing redundant quantization flags.

For reference, see the Qwen2.5-VL usage guide, which recommends tuning --max-model-len and --max-num-batched-tokens to balance throughput and latency, and notes that high batch sizes can increase TTFT under heavy load. Would you like a step-by-step breakdown or example commands for optimal settings?

Sources:

Qwen2.5-VL usage guide

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Weird benchmarking results regardin QwenVL/30B/8B/4B [solved - moe xd] General	4	658	October 21, 2025
如何尽可能提升推理服务的吞吐量 General	3	342	January 4, 2026
Vllm Latency(TTFT) Spike on load Tesing General	9	514	August 16, 2025
The service carrying capacity and computing efficiency issues of vllm General	0	36	July 16, 2025
H20 running Qwen3-30B-A3B-AWQ failed General	2	107	November 27, 2025

Is the H200 QWEN3-VL-4B-FP8 performing poorly?

Related topics