Hi everyone,
I’m currently running Qwen3-32B using vLLM 0.8.4 (v1) for long-context inference, and I’m experiencing significant performance degradation when handling long inputs (~2K+ tokens), even though the total token count (input + output) is roughly similar across samples.
Here are my current setup details:
- Model: Qwen3-32B
- Framework: vLLM 0.8.4 (v1)
- Tensor Parallelism:
tp=2
- Max output length: ~28K tokens (average 10K+)
- Input length:
- Short: ~100 tokens
- Long: ~2K+ tokens
- Hardware: 2xH800 80G
Observations
When processing short input prompts, the speed is quite decent:
Processed prompts: 55%|█████▍ | 1161/2112 [3:57:50<3:44:08, 14.14s/it, est. speed input: 6.40 toks/s, output: 831.10 toks/s]
Processed prompts: 55%|█████▌ | 1162/2112 [3:58:16<4:43:07, 17.88s/it, est. speed input: 6.39 toks/s, output: 829.93 toks/s]
However, when dealing with longer input prompts (~2K+ tokens), the throughput drops significantly:
Processed prompts: 14%|█▍ | 420/2952 [4:05:20<16:20:13, 23.23s/it, est. speed input: 59.82 toks/s, output: 509.59 toks/s]
Processed prompts: 14%|█▍ | 421/2952 [4:06:30<26:13:55, 37.31s/it, est. speed input: 59.68 toks/s, output: 508.23 toks/s]
The total number of tokens processed per request (input + output) is roughly balanced between short and long input cases, yet the long-input scenario is much slower.
Questions
I’m wondering if anyone has experienced similar issues or could offer suggestions on how to optimize vLLM’s performance for both long input and output sequences.
Thanks in advance!