How can I find top-down architecture of vLLM?

Hi, vLLM folks!
I am new to vLLM and want to get started to learn the architecture of vLLM.

I know some key components of vLLM (e.g., PageAttention, scheduler). However, is there any simple and clear top-down architecture figure or illustration of vLLM? in other words, what’s the work pipeline of vLLM? Can this be classified into different levels?

BTW, I have found the vLLM v1 architecture blog, but I don’t think some figures in it are clear. They even make me feel confused. What I want to know is, as an end-to-end engine, what is the working sequence of vLLM from input to output?

I think if there is any Architecture diagram for such LLM inference engine, it would be more friendly to newcomers!
Thanks!

Yes, vLLM provides clear top-down architecture diagrams and explanations for newcomers. The Architecture Overview page includes diagrams showing the main entrypoints (LLM class for offline inference, OpenAI-compatible API server for online inference), and how these connect to the core engine classes (LLMEngine and AsyncLLMEngine). The diagrams also break down the engine into levels: input processing, scheduling, model execution (possibly distributed), and output processing, as well as the class hierarchy from engine to worker to model runner to model.

You can find these diagrams and explanations in the official documentation, which visually and textually describe the vLLM work pipeline and its modular levels, making it easier for newcomers to understand the system’s structure and workflow. Would you like a step-by-step breakdown of these diagrams and their components?

Sources: