How is vLLM handling internal queue requests?

Ranjith · May 4, 2026, 9:59am

@RunLLM, I know that vLLM is queuing the requests and completing the requests based on concurrency availble KV Cache available. Explain on this. And my questions is on request that get cancelled that is let’s say from client side the request gets timedout or got cancelled How will vLLM cancel executing that specific request from the queue?

RunLLM · May 4, 2026, 10:00am

When a client disconnects or cancels a request (e.g., due to timeout), vLLM detects the HTTP disconnect and aborts the corresponding request. For queued requests, vLLM cancels them before execution; for running requests, it interrupts processing, but the effectiveness depends on the model and pipeline—some requests may only be aborted after the current execution step completes, especially for long-running tasks. GPU resources are released promptly for cancelled requests, but full cleanup may depend on pipeline implementation and granularity of abort checks. This mechanism is already implemented for LLMs and is being improved for diffusion models to check for aborts between execution steps for faster cancellation response times. See details in vllm-omni issue #1347 and vllm issue #20798.

If you want to actively cancel a request from the client side, simply closing the HTTP connection will trigger vLLM to abort the request. There is no official API endpoint to abort a specific request by ID from the server side; cancellation is handled via client disconnects. For more, see vllm issue #6975.

Would you like a step-by-step breakdown of the cancellation flow or more technical details?

Sources:

Hint: Mention @RunLLM in the post for followups.

Ranjith · May 4, 2026, 10:02am

@RunLLM How is vLLM handling to abort a request which is already in the queue is it using any hashmap could you explain more on that?

RunLLM · May 4, 2026, 10:03am

vLLM tracks queued requests using a dictionary (hashmap) keyed by request ID, allowing efficient lookup and removal. When a request is aborted (e.g., due to client disconnect), vLLM removes the request from the queue and its internal tracking structures, ensuring it is not scheduled or executed. This is handled in the scheduler logic, where methods like remove_request and remove_requests are used to delete specific requests from the queue and the internal requests dictionary (API docs).

This approach ensures that aborting a queued request is fast and does not require scanning the entire queue. For more, see request_queue API and scheduler.py source.

Would you like a step-by-step explanation of how the abort/removal process works internally?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Aborted request during benchmarking (H200-Llama 3.3 70B) General	1	477	July 2, 2025
Trace requests through vLLM v1 General	1	210	May 29, 2025
LLM memory caching General	7	60	May 7, 2026
Vllm 0.10.1 v1 benchmark Only a part of the requests can be processed before it gets stuck General	1	165	November 4, 2025
VLLM 0.10.1 benchmark do not free memory Benchmarking	13	192	November 10, 2025

How is vLLM handling internal queue requests?

Related topics