Help needed with vLLM Qwen3-32B inference performance tuning (long input slow)

Hi everyone,

I’m currently running Qwen3-32B using vLLM 0.8.4 (v1) for long-context inference, and I’m experiencing significant performance degradation when handling long inputs (~2K+ tokens), even though the total token count (input + output) is roughly similar across samples.

Here are my current setup details:

  • Model: Qwen3-32B
  • Framework: vLLM 0.8.4 (v1)
  • Tensor Parallelism: tp=2
  • Max output length: ~28K tokens (average 10K+)
  • Input length:
    • Short: ~100 tokens
    • Long: ~2K+ tokens
  • Hardware: 2xH800 80G

:magnifying_glass_tilted_left: Observations

When processing short input prompts, the speed is quite decent:

Processed prompts:  55%|█████▍    | 1161/2112 [3:57:50<3:44:08, 14.14s/it, est. speed input: 6.40 toks/s, output: 831.10 toks/s]
Processed prompts:  55%|█████▌    | 1162/2112 [3:58:16<4:43:07, 17.88s/it, est. speed input: 6.39 toks/s, output: 829.93 toks/s]

However, when dealing with longer input prompts (~2K+ tokens), the throughput drops significantly:

Processed prompts:  14%|█▍        | 420/2952 [4:05:20<16:20:13, 23.23s/it, est. speed input: 59.82 toks/s, output: 509.59 toks/s]
Processed prompts:  14%|█▍        | 421/2952 [4:06:30<26:13:55, 37.31s/it, est. speed input: 59.68 toks/s, output: 508.23 toks/s]

The total number of tokens processed per request (input + output) is roughly balanced between short and long input cases, yet the long-input scenario is much slower.

Questions

I’m wondering if anyone has experienced similar issues or could offer suggestions on how to optimize vLLM’s performance for both long input and output sequences.

Thanks in advance!

Performance degradation with long input prompts in vLLM is a known issue, mainly due to the compute-bound nature of the prefill (input) phase, which becomes more pronounced as input length increases. Even if total tokens (input + output) are similar, long inputs require more compute for attention and KV cache writes, slowing throughput. Chunked prefill is enabled by default in vLLM v1 to help mitigate this, but tuning max_num_batched_tokens can further optimize performance—higher values improve throughput for large inputs, while lower values reduce latency for shorter ones. For large models and long contexts, setting max_num_batched_tokens above 8096 is recommended for optimal throughput, especially on large GPUs like H800s. Also, ensure you are not hitting memory or KV cache bottlenecks, and consider increasing tensor parallelism if possible, but be aware of potential synchronization overheads. See Performance Tuning with Chunked Prefill and Optimization and Tuning for more details.

If you are still seeing significant slowdowns, monitor GPU utilization and KV cache usage to ensure you are not running into preemption or memory issues. You may also experiment with pipeline parallelism or expert parallelism if your hardware and model support it, as described in Parallelism Strategies. Would you like a step-by-step guide on tuning these parameters for your specific setup?

Sources: