I’m observing an unexpected scheduling behavior when benchmarking short-output workloads.
Setup
- Model: QuantTrio/MiniMax-M2.5-AWQ
- Tensor Parallelism: 8
- Expert Parallelism: enabled
- Context length: 196608
- Hardware: 8 * H200 GPUs
- vLLM serving via OpenAI-compatible API
Launch command
vllm serve /data/models/QuantTrio/MiniMax-M2.5-AWQ \
--served-model-name MiniMax-M2.5 \
--max-model-len 196608 \
--max-num-seqs 256 \
--max-num-batched-tokens 65536 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
Benchmark scenarios
Case 1
vllm bench serve \
--backend openai-chat \
--endpoint /v1/chat/completions \
--base-url http://127.0.0.1:8000 \
--model /data/models/QuantTrio/MiniMax-M2.5-AWQ \
--tokenizer /data/models/QuantTrio/MiniMax-M2.5-AWQ \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 1 \
--num-prompts 768 \
--max-concurrency 64 \
--trust-remote-code \
--temperature 0.7 \
--top-p 0.9 \
--served-model-name MiniMax-M2.5 \
--seed $(date +%s)
- Input length: 1024
- Output length: 1
- Concurrency: 64
Observed:
- TTFT is unexpectedly high
- Running requests remain around 8
- Waiting queue stays around 30-40
Case 2
- Input length: 1024
- Output length: 100
- Same concurrency: 64
Observed:
- Waiting queue is significantly smaller
- TTFT is lower
- Overall throughput is higher and more stable
Question
- Why does very short decode length lead to higher TTFT?
- Are there recommended parameters for optimizing?