Help needed with vLLM Qwen3-32B inference performance tuning (long input slow)

ToheartZhang · May 26, 2025, 3:10am

Hi everyone,

I’m currently running Qwen3-32B using vLLM 0.8.4 (v1) for long-context inference, and I’m experiencing significant performance degradation when handling long inputs (~2K+ tokens), even though the total token count (input + output) is roughly similar across samples.

Here are my current setup details:

Model: Qwen3-32B
Framework: vLLM 0.8.4 (v1)
Tensor Parallelism: tp=2
Max output length: ~28K tokens (average 10K+)
Input length:
- Short: ~100 tokens
- Long: ~2K+ tokens
Hardware: 2xH800 80G

Observations

When processing short input prompts, the speed is quite decent:

Processed prompts:  55%|█████▍    | 1161/2112 [3:57:50<3:44:08, 14.14s/it, est. speed input: 6.40 toks/s, output: 831.10 toks/s]
Processed prompts:  55%|█████▌    | 1162/2112 [3:58:16<4:43:07, 17.88s/it, est. speed input: 6.39 toks/s, output: 829.93 toks/s]

However, when dealing with longer input prompts (~2K+ tokens), the throughput drops significantly:

Processed prompts:  14%|█▍        | 420/2952 [4:05:20<16:20:13, 23.23s/it, est. speed input: 59.82 toks/s, output: 509.59 toks/s]
Processed prompts:  14%|█▍        | 421/2952 [4:06:30<26:13:55, 37.31s/it, est. speed input: 59.68 toks/s, output: 508.23 toks/s]

The total number of tokens processed per request (input + output) is roughly balanced between short and long input cases, yet the long-input scenario is much slower.

Questions

I’m wondering if anyone has experienced similar issues or could offer suggestions on how to optimize vLLM’s performance for both long input and output sequences.

Thanks in advance!

RunLLM · May 26, 2025, 3:10am

Performance degradation with long input prompts in vLLM is a known issue, mainly due to the compute-bound nature of the prefill (input) phase, which becomes more pronounced as input length increases. Even if total tokens (input + output) are similar, long inputs require more compute for attention and KV cache writes, slowing throughput. Chunked prefill is enabled by default in vLLM v1 to help mitigate this, but tuning max_num_batched_tokens can further optimize performance—higher values improve throughput for large inputs, while lower values reduce latency for shorter ones. For large models and long contexts, setting max_num_batched_tokens above 8096 is recommended for optimal throughput, especially on large GPUs like H800s. Also, ensure you are not hitting memory or KV cache bottlenecks, and consider increasing tensor parallelism if possible, but be aware of potential synchronization overheads. See Performance Tuning with Chunked Prefill and Optimization and Tuning for more details.

If you are still seeing significant slowdowns, monitor GPU utilization and KV cache usage to ensure you are not running into preemption or memory issues. You may also experiment with pipeline parallelism or expert parallelism if your hardware and model support it, as described in Parallelism Strategies. Would you like a step-by-step guide on tuning these parameters for your specific setup?

Sources:

Topic		Replies	Views
Speeding up vllm inference for Qwen2.5-VL General	21	497	June 11, 2025
I published a performance test result of vllm vs sglang but can someone help me explain it? General	3	92	April 29, 2025
Multimodal inference guideline? General	13	61	May 27, 2025
Is Batch Inference for Multimodal Models Truly Batch Inference? General	5	58	May 14, 2025
[Spec Decode] Why does the acceptance rate look close to the paper, but the throughput is still not high? V1 Feedback	1	87	April 21, 2025

Help needed with vLLM Qwen3-32B inference performance tuning (long input slow)

Observations

Questions

Related topics