It seems that vLLM stops due to prefill

Hello,

I am running vLLM with the TP=2 option using two Pro6000 GPUs. The command used for the currently running Docker container is as follows:

docker run --name vllm-gpt-oss \
           --runtime nvidia --gpus all \
           -v ~/.cache/huggingface:/root/.cache/huggingface \
           --restart unless-stopped \
           -p 8000:8000 \
           --shm-size=8g \
           --entrypoint /bin/bash \
           vllm/vllm-openai:latest \
           -c "pip install --force-reinstall --no-deps nvidia-nccl-cu12==2.27.7 && \
               python3 -m vllm.entrypoints.openai.api_server \
               --model openai/gpt-oss-120b \
               --tensor-parallel-size 2"

Everything works perfectly, but when I send a query with a large document attached via OpenWebUI (or similar) that approaches the context‑length limit of 128 k, the prefill step causes all other sessions to pause for 10–20 seconds, or token generation becomes extremely slow. Because of this, other users inevitably have a very unpleasant experience.

Do you know any ways to mitigate or resolve this issue?

Thank you.

This slowdown is expected: when a very long prompt (near the context-length limit) is sent, the prefill (prompt processing) step is compute-bound and can block or slow down other requests, especially with tensor parallelism. vLLM V1 always enables “chunked prefill,” which helps by splitting large prefills into smaller chunks and interleaving them with decode requests, but if max_num_batched_tokens is set too high, long prefills can still dominate the batch and increase latency for other users. To mitigate this, reduce max_num_batched_tokens (e.g., to 2048–8192) to prioritize decode requests and improve inter-token latency (ITL) for all sessions. This will make the system more responsive during large prefills, at the cost of slightly lower throughput for long prompts. See Optimization and Tuning and Performance Tuning with Chunked Prefill for details.

Example:

from vllm import LLM
llm = LLM(model="openai/gpt-oss-120b", tensor_parallel_size=2, max_num_batched_tokens=4096)

Would you like more detail on tuning for your specific workload?

Sources: