@RunLLM, I know that vLLM is queuing the requests and completing the requests based on concurrency availble KV Cache available. Explain on this. And my questions is on request that get cancelled that is let’s say from client side the request gets timedout or got cancelled How will vLLM cancel executing that specific request from the queue?
When a client disconnects or cancels a request (e.g., due to timeout), vLLM detects the HTTP disconnect and aborts the corresponding request. For queued requests, vLLM cancels them before execution; for running requests, it interrupts processing, but the effectiveness depends on the model and pipeline—some requests may only be aborted after the current execution step completes, especially for long-running tasks. GPU resources are released promptly for cancelled requests, but full cleanup may depend on pipeline implementation and granularity of abort checks. This mechanism is already implemented for LLMs and is being improved for diffusion models to check for aborts between execution steps for faster cancellation response times. See details in vllm-omni issue #1347 and vllm issue #20798.
If you want to actively cancel a request from the client side, simply closing the HTTP connection will trigger vLLM to abort the request. There is no official API endpoint to abort a specific request by ID from the server side; cancellation is handled via client disconnects. For more, see vllm issue #6975.
Would you like a step-by-step breakdown of the cancellation flow or more technical details?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM How is vLLM handling to abort a request which is already in the queue is it using any hashmap could you explain more on that?
vLLM tracks queued requests using a dictionary (hashmap) keyed by request ID, allowing efficient lookup and removal. When a request is aborted (e.g., due to client disconnect), vLLM removes the request from the queue and its internal tracking structures, ensuring it is not scheduled or executed. This is handled in the scheduler logic, where methods like remove_request and remove_requests are used to delete specific requests from the queue and the internal requests dictionary (API docs).
This approach ensures that aborting a queued request is fast and does not require scanning the entire queue. For more, see request_queue API and scheduler.py source.
Would you like a step-by-step explanation of how the abort/removal process works internally?
Sources:
Hint: Mention @RunLLM in the post for followups.