Question about parameter: max-num-batched-tokens

huxiaoxiang · November 25, 2025, 6:32am

Hello, there.
I deployed a LLM model (Qwen3-235b-23b-instruct) in 8 GPUs, each GPU has 80GB VRAM。
But my prompt tokens varies from 1k to 16k，and I want to reduce Time To First Token.
So, my questions are:
(1) what this parameter used for?
(2) in my situation, I did’t config it, which value should I set?

vllm serve /model_dir --host 0.0.0.0 --port 8000 --root-path ‘/’ --trust-remote-code --gpu-memory-utilization 0.95 --tensor-parallel-size 8 --served-model-name Qwen3-235B-A22B-Instruct-2507 --max-model-len 65536 --max-num-seqs 32 --enable-auto-tool-choice --tool-call-parser hermes

RunLLM · November 25, 2025, 6:32am

The parameter --max-num-batched-tokens controls the maximum number of tokens (across all requests) that can be processed in a single batch. It directly impacts Time To First Token (TTFT) and throughput: higher values improve throughput and TTFT for long prompts, but may increase latency for short prompts; lower values reduce latency for short prompts but may slow down TTFT for long prompts. If not set, vLLM uses a default value, which may not be optimal for your wide prompt length range. According to vLLM optimization docs, for large models and variable prompt lengths, setting --max-num-batched-tokens to 8192 or higher is recommended for optimal throughput and TTFT, especially on large GPUs.

In your case (prompt tokens 1k–16k, 8×80GB GPUs, 235B model), try --max-num-batched-tokens=16384 or higher, and benchmark TTFT. Adjust up or down based on observed TTFT and GPU memory usage. For more detail on tuning, see here. Would you like a step-by-step explanation or example commands?

Sources:

Performance Tuning with Chunked Prefill

Hint: Mention @RunLLM in the post for followups.

huxiaoxiang · November 25, 2025, 7:11am

@RunLLM The ‘Sources‘ link you provided does not exist. Please give a solid link

RunLLM · November 25, 2025, 7:11am

Apologies for the confusion. The correct, solid link for tuning --max-num-batched-tokens and understanding its effect on TTFT is:
https://docs.vllm.ai/en/latest/performance/optimization.html#performance-tuning-with-chunked-prefill

Would you like a step-by-step guide on how to set and tune this parameter?

Sources:

Performance Tuning with Chunked Prefill

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
The vLLM server fails on large max-num-batched-tokens General	1	317	September 13, 2025
Vllm Latency(TTFT) Spike on load Tesing General	9	375	August 16, 2025
Batched processing over openai api General	3	59	November 30, 2025
Num request running stays on 1 Benchmarking	3	218	August 29, 2025
如何尽可能提升推理服务的吞吐量 General	3	154	January 4, 2026

Question about parameter: max-num-batched-tokens

Related topics