It seems that vLLM stops due to prefill

Cyp9715 · September 22, 2025, 7:42am

Hello,

I am running vLLM with the TP=2 option using two Pro6000 GPUs. The command used for the currently running Docker container is as follows:

docker run --name vllm-gpt-oss \
           --runtime nvidia --gpus all \
           -v ～/.cache/huggingface:/root/.cache/huggingface \
           --restart unless-stopped \
           -p 8000:8000 \
           --shm-size=8g \
           --entrypoint /bin/bash \
           vllm/vllm-openai:latest \
           -c "pip install --force-reinstall --no-deps nvidia-nccl-cu12==2.27.7 && \
               python3 -m vllm.entrypoints.openai.api_server \
               --model openai/gpt-oss-120b \
               --tensor-parallel-size 2"

Everything works perfectly, but when I send a query with a large document attached via OpenWebUI (or similar) that approaches the context‑length limit of 128 k, the prefill step causes all other sessions to pause for 10–20 seconds, or token generation becomes extremely slow. Because of this, other users inevitably have a very unpleasant experience.

Do you know any ways to mitigate or resolve this issue?

Thank you.

RunLLM · September 22, 2025, 7:42am

This slowdown is expected: when a very long prompt (near the context-length limit) is sent, the prefill (prompt processing) step is compute-bound and can block or slow down other requests, especially with tensor parallelism. vLLM V1 always enables “chunked prefill,” which helps by splitting large prefills into smaller chunks and interleaving them with decode requests, but if max_num_batched_tokens is set too high, long prefills can still dominate the batch and increase latency for other users. To mitigate this, reduce max_num_batched_tokens (e.g., to 2048–8192) to prioritize decode requests and improve inter-token latency (ITL) for all sessions. This will make the system more responsive during large prefills, at the cost of slightly lower throughput for long prompts. See Optimization and Tuning and Performance Tuning with Chunked Prefill for details.

Example:

from vllm import LLM
llm = LLM(model="openai/gpt-oss-120b", tensor_parallel_size=2, max_num_batched_tokens=4096)

Would you like more detail on tuning for your specific workload?

Sources:

Topic		Replies	Views
Help needed with vLLM Qwen3-32B inference performance tuning (long input slow) General	1	635	May 26, 2025
Num request running stays on 1 Benchmarking	3	45	August 29, 2025
优化prompt来提高模型API调用的Prefill命中率 General	3	60	July 30, 2025
Vllm Latency(TTFT) Spike on load Tesing General	9	112	August 16, 2025
The vLLM server fails on large max-num-batched-tokens General	1	43	September 13, 2025

It seems that vLLM stops due to prefill

Related topics