Vllm Latency(TTFT) Spike on load Tesing

saifulislam79 · August 15, 2025, 6:03pm

I have run 40 concurrent per request with 40 run and then got spike of some request processing. please let me know how can i will solve this issue and i will grade-full to you.

Git Issue:[Bug]: How to setup stable latency on VLLM for streaming · Issue #22970 · vllm-project/vllm · GitHub

RunLLM · August 15, 2025, 6:05pm

The spikes in Time to First Token (TTFT) you observed during high concurrency are a known issue in vLLM, especially when the system is under heavy load or when batch sizes approach hardware or configuration limits. This can be caused by factors such as insufficient KV cache space, suboptimal batch scheduling, or preemption/recomputation events. To mitigate these spikes, you can try the following:

Increase gpu_memory_utilization to allocate more KV cache space, reducing the likelihood of preemption and recomputation delays.
Decrease max_num_seqs or max_num_batched_tokens to reduce batch size, which can help lower TTFT spikes at the cost of some throughput.
Increase tensor_parallel_size or pipeline_parallel_size if you have multiple GPUs, to better distribute memory and computation, but be aware of potential synchronization or latency trade-offs.
Monitor preemption events using Prometheus metrics or logs to confirm if preemption is causing the spikes, and adjust the above parameters accordingly.

For more details and tuning strategies, see Optimization and Tuning and related GitHub discussions. Would you like a step-by-step guide on how to apply these changes or interpret the metrics?

Sources:

saifulislam79 · August 15, 2025, 6:23pm

 engine_kwargs = {

        # quantization: Enables FP8 quantization for model weights. While 'fp8' can reduce memory and increase speed,

# simple per-tensor FP8 might not yield optimal performance or could degrade quality due to dynamic scale calculations or outlier issues.[9, 10]

# For optimal quality, consider calibrated scales or PTPC-FP8 (on AMD ROCm).[11, 12]

"quantization": "fp8",

#kv_cache_dtype: Storing Key-Value (KV) cache data in FP8 significantly enhances memory efficiency,

# effectively doubling the maximum token capacity within the same memory footprint.[6, 7, 8]

# This directly boosts the number of concurrent requests. Supported on CUDA 11.8+ and ROCm.[7]

"kv_cache_dtype": "fp8",

"gpu_memory_utilization": 0.80,

# max_num_batched_tokens: Maximum total tokens across all sequences in a single batch.

# Crucial for prefill efficiency. VLLM recommends setting this > 2048 for optimal throughput.[15]

"max_num_batched_tokens": 4096, # Set to a value > 2048

# tokenizer_mode: Controls the tokenizer implementation. "auto" attempts to use a fast, Rust-based tokenizer,

# which significantly reduces CPU overhead and prevents GPU idle time.[16, 17]

"tokenizer_mode": "auto",

# disable_log_stats: Disables detailed performance statistics logging.

# This reduces CPU overhead in high-throughput production environments, freeing resources for inference.[18, 19]

"disable_log_stats": True,

# max_seq_len_to_capture: Maximum sequence length for which CUDA graphs are employed.

# Tune to encompass the majority of expected sequence lengths to maximize CUDA graph utilization.[22]

"max_seq_len_to_capture": 2048, # Should ideally match max_model_len for consistent CUDA graph usage

# "max_model_len": 16384,  # Should match max_seq_len_to_capture

"max_num_seqs": 32,

"tensor_parallel_size": 1,

"enable_chunked_prefill": False,  # Explicitly set to False to prioritize prefill operations, optimizing TTFT

# "dtype": torch.bfloat16,  # Explicitly set dtype for clarity and consistency

# "use_cuda_graphs": True,  # Enable CUDA graphs for performance optimization

# "use_flash_attention": True,  # Enable flash attention for faster inference

"trust_remote_code": True,

This is also configuration but same issue

RunLLM · August 15, 2025, 6:24pm

Your engine_kwargs are mostly valid, but a few points need clarification:

"quantization": "fp8" and "kv_cache_dtype": "fp8" are supported and will reduce memory usage and increase throughput, but for best accuracy, use calibrated scales (see Quantized KV Cache docs).
"enable_chunked_prefill": False is not recommended in vLLM V1, as chunked prefill is always enabled by default and improves throughput and latency (Optimization and Tuning). Setting it to False may not have any effect.

All other parameters are supported and align with vLLM’s documented configuration. Would you like a step-by-step explanation of each parameter or advice on further tuning?

Sources:

saifulislam79 · August 15, 2025, 6:26pm

I have also used but found same issue

RunLLM · August 15, 2025, 6:26pm

Setting enable_chunked_prefill=True is the default in vLLM V1, and chunked prefill is always enabled for improved throughput and latency. If you are still experiencing the same issue (e.g., TTFT spikes or slowdowns), chunked prefill alone may not resolve it—other factors like KV cache size, batch size, or GPU memory utilization may be the cause (Optimization and Tuning).

If you need to further reduce TTFT spikes, try increasing gpu_memory_utilization, decreasing max_num_seqs, or tuning max_num_batched_tokens. Would you like a step-by-step guide for these adjustments?

Sources:

Optimization and Tuning

saifulislam79 · August 15, 2025, 6:29pm

what is the optimize value of this and i need to test concurrent 40 request

RunLLM · August 15, 2025, 6:29pm

There is no universal “optimal” value for max_num_seqs and max_num_batched_tokens—they must be tuned for your hardware, model size, and workload. For 40 concurrent requests, set max_num_seqs to at least 40, and set max_num_batched_tokens to a value that balances throughput and latency (vLLM recommends >2048 for throughput; try 4096 or higher if memory allows) (Optimization and Tuning, GitHub Issue 2492).

To find the best values, use the auto_tune.sh script in the vLLM repo, which benchmarks different combinations for your setup (auto_tune.sh README). Would you like a step-by-step example of how to run this script?

Sources:

saifulislam79 · August 16, 2025, 6:54am

I need llama3b-instructor model vllm configuration which server 40 concurrent request without latency spike issue

RunLLM · August 16, 2025, 6:54am

To serve 40 concurrent requests with the Llama-3B-Instruct model in vLLM and minimize latency spikes, set max_num_seqs to at least 40 and max_num_batched_tokens to 4096 or higher (if GPU memory allows). Also, set gpu_memory_utilization to a safe high value (e.g., 0.9), and use the default chunked prefill. Example:

from vllm import LLM
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    max_num_seqs=40,
    max_num_batched_tokens=4096,
    gpu_memory_utilization=0.9
)

Monitor for preemption warnings and, if needed, further increase max_num_batched_tokens or adjust max_num_seqs based on your GPU memory and observed performance (Optimization and Tuning). Would you like a detailed step-by-step tuning guide?

Sources:

Optimization and Tuning

Topic		Replies	Views
The vLLM server fails on large max-num-batched-tokens General	1	74	September 13, 2025
Help needed with vLLM Qwen3-32B inference performance tuning (long input slow) General	1	721	May 26, 2025
Num request running stays on 1 Benchmarking	3	65	August 29, 2025
Aborted request during benchmarking (H200-Llama 3.3 70B) General	1	272	July 2, 2025
使用以下2种方式，获得的结果有很大差异 Benchmarking	50	632	July 25, 2025

Vllm Latency(TTFT) Spike on load Tesing

Related topics