Why is TTFT worse for decode=1 than decode=100?

zbximo · April 26, 2026, 1:27pm

I’m observing an unexpected scheduling behavior when benchmarking short-output workloads.

Setup

Model: QuantTrio/MiniMax-M2.5-AWQ
Tensor Parallelism: 8
Expert Parallelism: enabled
Context length: 196608
Hardware: 8 * H200 GPUs
vLLM serving via OpenAI-compatible API

Launch command

vllm serve   /data/models/QuantTrio/MiniMax-M2.5-AWQ     \
--served-model-name MiniMax-M2.5         \
--max-model-len 196608 \
--max-num-seqs 256  \
--max-num-batched-tokens 65536    \
--gpu-memory-utilization 0.9     \
--tensor-parallel-size 8     \
--enable-expert-parallel     \
--enable-auto-tool-choice     \
--tool-call-parser minimax_m2    \
--reasoning-parser minimax_m2_append_think     \
--trust-remote-code     \
--host 0.0.0.0     \
--port 8000

Benchmark scenarios

Case 1

vllm bench serve \
--backend openai-chat   \
--endpoint /v1/chat/completions   \
--base-url http://127.0.0.1:8000   \
--model /data/models/QuantTrio/MiniMax-M2.5-AWQ   \
--tokenizer /data/models/QuantTrio/MiniMax-M2.5-AWQ   \
--dataset-name random   \
--random-input-len  1024  \
--random-output-len 1    \
--num-prompts 768   \
--max-concurrency 64   \
--trust-remote-code   \
--temperature 0.7   \
--top-p 0.9   \
--served-model-name MiniMax-M2.5   \
--seed $(date +%s)

Input length: 1024
Output length: 1
Concurrency: 64

Observed:

TTFT is unexpectedly high
Running requests remain around 8
Waiting queue stays around 30-40

Case 2

Input length: 1024
Output length: 100
Same concurrency: 64

Observed:

Waiting queue is significantly smaller
TTFT is lower
Overall throughput is higher and more stable

Question

Why does very short decode length lead to higher TTFT?
Are there recommended parameters for optimizing?

RunLLM · April 26, 2026, 1:28pm

Short decode lengths (e.g., output length = 1) cause high TTFT and large waiting queues because vLLM’s scheduler prioritizes batching and throughput, so with minimal decode, most time is spent in the prefill (prompt) stage, which is compute-bound and processed in large batches. This means requests wait longer for batching, and the system is less able to overlap prompt and decode work, leading to higher TTFT and fewer running requests at a time. With longer outputs, decode steps dominate, allowing more requests to be processed concurrently and reducing queueing delays, thus improving TTFT and throughput. This is a known behavior in vLLM and similar LLM serving systems, especially with large models and high parallelism settings, as described in performance discussions and benchmarking docs.

To optimize for short-output workloads, reduce --max-num-batched-tokens (e.g., to 2048–8192) to decrease prefill batch size and improve TTFT, and consider lowering --max-num-seqs to match expected concurrency. This will reduce batching delays at the cost of some throughput, which is generally acceptable for interactive, low-output scenarios. See performance tuning guidance and parallelism strategies for more details.

Would you like a step-by-step breakdown or example launch commands?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Some questions about Maximum concurrency General	5	1984	July 22, 2025
Suggestion to improve inferencing speed Model Support	17	949	March 11, 2026
Help needed with vLLM Qwen3-32B inference performance tuning (long input slow) General	1	1396	May 26, 2025
使用以下2种方式，获得的结果有很大差异 Benchmarking	50	2447	July 25, 2025
Vllm Latency(TTFT) Spike on load Tesing General	9	627	August 16, 2025