When you call llm.generate([p1, p2, ..., pn]), vLLM automatically batches the prompts, considering available GPU memory. If the total size of all requests exceeds memory, vLLM splits them into smaller batches and processes them sequentially, ensuring each batch fits within memory constraints. You do not need to manually split the prompts—vLLM handles this for you, and results are returned in the same order as the input prompts. This batching logic applies to both prompt and generation phases, and is designed to maximize throughput while respecting memory limits according to vLLM API docs, vLLM batching explanation, and vLLM batching logic.
Would you like more detail on how vLLM schedules and splits batches internally?
Hey! Yes, can you explain how it handles internally, what’s the objective that VLLM tries to maximise when batching?
If the largest input+output prompt of my batch doesn’t exceed the memory, can I assume i can never go oom?
It seems likely that vLLM’s internal objective is to maximize GPU throughput by batching as many requests as possible within available memory, using a continuous batching scheduler. vLLM automatically splits large input lists into smaller batches, each fitting within memory, and processes them sequentially or in parallel as resources allow. The scheduler considers both the number of sequences (max_num_seqs) and total tokens (max_num_batched_tokens) to avoid OOM, but also aims to keep the GPU as busy as possible for efficiency (vLLM batching logic, batch size control).
However, even if the largest single input+output fits in memory, you can still go OOM if the combined memory usage of all concurrent requests in a batch exceeds available GPU memory. vLLM tries to prevent this by limiting batch size, but factors like logprobs, quantization, or other memory-consuming features can still cause OOM in edge cases (OOM discussion, batching and OOM). So, while vLLM reduces OOM risk, it cannot guarantee it will never happen in all scenarios.
Would you like a step-by-step breakdown of the scheduler’s batching and memory management?