Why is TTFT worse for decode=1 than decode=100?

I’m observing an unexpected scheduling behavior when benchmarking short-output workloads.

Setup

  • Model: QuantTrio/MiniMax-M2.5-AWQ
  • Tensor Parallelism: 8
  • Expert Parallelism: enabled
  • Context length: 196608
  • Hardware: 8 * H200 GPUs
  • vLLM serving via OpenAI-compatible API

Launch command

vllm serve   /data/models/QuantTrio/MiniMax-M2.5-AWQ     \
--served-model-name MiniMax-M2.5         \
--max-model-len 196608 \
--max-num-seqs 256  \
--max-num-batched-tokens 65536    \
--gpu-memory-utilization 0.9     \
--tensor-parallel-size 8     \
--enable-expert-parallel     \
--enable-auto-tool-choice     \
--tool-call-parser minimax_m2    \
--reasoning-parser minimax_m2_append_think     \
--trust-remote-code     \
--host 0.0.0.0     \
--port 8000

Benchmark scenarios

Case 1

vllm bench serve \
--backend openai-chat   \
--endpoint /v1/chat/completions   \
--base-url http://127.0.0.1:8000   \
--model /data/models/QuantTrio/MiniMax-M2.5-AWQ   \
--tokenizer /data/models/QuantTrio/MiniMax-M2.5-AWQ   \
--dataset-name random   \
--random-input-len  1024  \
--random-output-len 1    \
--num-prompts 768   \
--max-concurrency 64   \
--trust-remote-code   \
--temperature 0.7   \
--top-p 0.9   \
--served-model-name MiniMax-M2.5   \
--seed $(date +%s)
  • Input length: 1024
  • Output length: 1
  • Concurrency: 64

Observed:

  • TTFT is unexpectedly high
  • Running requests remain around 8
  • Waiting queue stays around 30-40

Case 2

  • Input length: 1024
  • Output length: 100
  • Same concurrency: 64

Observed:

  • Waiting queue is significantly smaller
  • TTFT is lower
  • Overall throughput is higher and more stable

Question

  1. Why does very short decode length lead to higher TTFT?
  2. Are there recommended parameters for optimizing?

Short decode lengths (e.g., output length = 1) cause high TTFT and large waiting queues because vLLM’s scheduler prioritizes batching and throughput, so with minimal decode, most time is spent in the prefill (prompt) stage, which is compute-bound and processed in large batches. This means requests wait longer for batching, and the system is less able to overlap prompt and decode work, leading to higher TTFT and fewer running requests at a time. With longer outputs, decode steps dominate, allowing more requests to be processed concurrently and reducing queueing delays, thus improving TTFT and throughput. This is a known behavior in vLLM and similar LLM serving systems, especially with large models and high parallelism settings, as described in performance discussions and benchmarking docs.

To optimize for short-output workloads, reduce --max-num-batched-tokens (e.g., to 2048–8192) to decrease prefill batch size and improve TTFT, and consider lowering --max-num-seqs to match expected concurrency. This will reduce batching delays at the cost of some throughput, which is generally acceptable for interactive, low-output scenarios. See performance tuning guidance and parallelism strategies for more details.

Would you like a step-by-step breakdown or example launch commands?

Sources:

Hint: Mention @RunLLM in the post for followups.