I want to implement and experiment with different scheduling algorithms on top of the vLLM framework. More specifically, my goal is to tweak the way memory (KV cache) blocks are assigned to each incoming prompt, and to be able to write custom logic that assigns memory blocks in a controlled, algorithm-defined manner.
I have started reading the vLLM codebase, but it is fairly large, and I am not sure where the right entry points are. At the moment, I only have a rough guess that the scheduler and block manager related classes are the main components I should focus on.
My questions are:
Which parts of the codebase should I study first to understand how memory blocks are allocated to requests/prompts?
Is the scheduler the correct place to implement custom logic for memory-block assignment, or is this primarily handled elsewhere (e.g., block managers / KV cache managers)?
Are there recommended extension points or abstractions for experimenting with new scheduling or memory allocation policies without heavily modifying the core engine?
Any pointers to specific files, classes, or prior discussions would be greatly appreciated.
Thanks in advance — any guidance on how to approach this systematically would be very helpful.
To experiment with custom scheduling and memory (KV cache) allocation in vLLM, you should focus on the Scheduler and KV cache manager components. The scheduler determines which requests are scheduled, while the actual memory block allocation is handled by the KV cache manager and its underlying block manager classes. The main extension points for memory allocation logic are in the KV cache manager, not the scheduler itself. For scheduling policies, vLLM provides a pluggable scheduler interface, and for memory allocation, the block manager (e.g., SelfAttnBlockSpaceManager) is the key abstraction to study and potentially extend. See block manager API docs and hybrid KV cache manager design for details.
Recommended files to start with are vllm/core/block_manager.py (block allocation logic), vllm/v1/core/kv_cache_manager.py (KV cache management), and vllm/v1/core/scheduler.py (scheduling policies). For experimenting, vLLM supports pluggable scheduling policies and modular block manager implementations, so you can subclass or swap these components without heavy core modifications. See also the priority scheduling RFC and adaptive scheduler RFC for examples of custom scheduling logic.
Would you like a step-by-step breakdown of how a request flows through these components or example code for a custom scheduler?
Certainly! Here’s a step-by-step breakdown of how a request flows through vLLM’s scheduling and memory allocation components, with pointers to the relevant files and extension points:
The scheduler selects which requests to run based on its policy (see SchedulingPolicy and the pluggable scheduler interface). You can implement custom scheduling by subclassing or swapping the scheduler class (example PR for pluggable scheduler).
KV Cache Block Allocation
When a request is scheduled, the scheduler calls into the KV cache manager (KVCacheManager in vllm/v1/core/kv_cache_manager.py), which manages logical block tables for each request.
The KV cache manager delegates actual block allocation to the block manager (e.g., SelfAttnBlockSpaceManager in vllm/core/block_manager.py), which handles the assignment and tracking of physical memory blocks.
Custom Memory Allocation Logic
To experiment with memory allocation, subclass or modify the block manager. The key methods are allocate, can_allocate, and free.
For hybrid models or advanced policies, see the hybrid KV cache manager design for how different attention types are handled.
Extension Points
For scheduling: implement a custom scheduler class and set it via the scheduler_cls config (see example).
For memory allocation: subclass the block manager or KV cache manager and override allocation logic.
Both components are modular, so you can experiment without major core changes.
# Example: Custom scheduler class
from vllm.v1.core.sched.scheduler import Scheduler
class MyCustomScheduler(Scheduler):
def schedule(self):
# Custom logic here
pass
# Set in config: scheduler_cls=MyCustomScheduler
Let me know if you want a code example for a custom block manager or more details on a specific step!