I have one H200 140GB GPU video card, I run it with these parameters vllm serve Qwen/Qwen3-VL-4B-Instruct-FP8 --limit-mm-per-prompt.video 0 --quantization fp8 --kv-cache-dtype fp8_e4m3 --max-model-len 2048 --dtype bfloat16 --max-num-batched-tokens 16384 --port 80 --download_dir /model --gpu-memory-utilization 0.9 --no-enforce-eager --trust-remote-code --enable-chunked-prefill, but I get benchmark results
============ Serving Benchmark Result ============
Successful requests: 6177
Request rate configured (RPS): 48.00
Benchmark duration (s): 416.82
Total input tokens: 287562
Total generated tokens: 754172
Request throughput (req/s): 14.82
Output token throughput (tok/s): 1809.37
Peak output token throughput (tok/s): 6109.00
Peak concurrent requests: 3751.00
Total Token throughput (tok/s): 2499.27
---------------Time to First Token----------------
Mean TTFT (ms): 121584.46
Median TTFT (ms): 160674.25
P99 TTFT (ms): 235113.81
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 79.94
Median TPOT (ms): 76.32
P99 TPOT (ms): 283.87
---------------Inter-token Latency----------------
Mean ITL (ms): 133.72
Median ITL (ms): 67.87
P99 ITL (ms): 1233.26
What am I doing wrong?