Hello,
I am running vLLM with the TP=2 option using two Pro6000 GPUs. The command used for the currently running Docker container is as follows:
docker run --name vllm-gpt-oss \
--runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--restart unless-stopped \
-p 8000:8000 \
--shm-size=8g \
--entrypoint /bin/bash \
vllm/vllm-openai:latest \
-c "pip install --force-reinstall --no-deps nvidia-nccl-cu12==2.27.7 && \
python3 -m vllm.entrypoints.openai.api_server \
--model openai/gpt-oss-120b \
--tensor-parallel-size 2"
Everything works perfectly, but when I send a query with a large document attached via OpenWebUI (or similar) that approaches the context‑length limit of 128 k, the prefill step causes all other sessions to pause for 10–20 seconds, or token generation becomes extremely slow. Because of this, other users inevitably have a very unpleasant experience.
Do you know any ways to mitigate or resolve this issue?
Thank you.