Hello, there.
I deployed a LLM model (Qwen3-235b-23b-instruct) in 8 GPUs, each GPU has 80GB VRAM。
But my prompt tokens varies from 1k to 16k,and I want to reduce Time To First Token.
So, my questions are:
(1) what this parameter used for?
(2) in my situation, I did’t config it, which value should I set?
The parameter --max-num-batched-tokens controls the maximum number of tokens (across all requests) that can be processed in a single batch. It directly impacts Time To First Token (TTFT) and throughput: higher values improve throughput and TTFT for long prompts, but may increase latency for short prompts; lower values reduce latency for short prompts but may slow down TTFT for long prompts. If not set, vLLM uses a default value, which may not be optimal for your wide prompt length range. According to vLLM optimization docs, for large models and variable prompt lengths, setting --max-num-batched-tokens to 8192 or higher is recommended for optimal throughput and TTFT, especially on large GPUs.
In your case (prompt tokens 1k–16k, 8×80GB GPUs, 235B model), try --max-num-batched-tokens=16384 or higher, and benchmark TTFT. Adjust up or down based on observed TTFT and GPU memory usage. For more detail on tuning, see here. Would you like a step-by-step explanation or example commands?