If we have 1000 data query_list, a single card run, when we call LLM.generate(query_list), how does engine process these 1000 requests
When you call LLM.generate(query_list)
with 1000 prompts, vLLM automatically batches the input sequences based on available GPU memory. The engine maintains a queue and processes as many requests as fit in memory at once, then continues with the next batch until all prompts are completed. This maximizes throughput and GPU utilization; you do not need to manually split the queries—just pass the full list to generate
and vLLM handles batching and scheduling internally. For best performance, always provide all prompts at once in a single call to generate
as recommended in the documentation and issues[^1][^2][^3].
If the number of prompts exceeds what can be processed in a single batch, vLLM will automatically process them in multiple batches, returning results in the same order as the input list[^4]. The batching is dynamic and adapts to the memory constraints of your hardware.
Would you like more detail on the internal batching or scheduling process?
Sources:
What are the details of how vllm automatically organizes batches? How is it organized?
vLLM uses a dynamic, continuous batching scheduler. When you submit many requests (e.g., via LLM.generate(query_list)
), vLLM maintains an internal queue and automatically groups as many requests as possible into a batch, constrained by GPU memory and parameters like max_num_seqs
(max batch size) and max_num_batched_tokens
(max tokens per batch). The scheduler fills each batch with pending requests, processes them together, and as soon as space is available (e.g., when a request finishes), it immediately adds new requests to the next batch—this is called continuous batching[^1][^2][^3][^4].
Batch composition is determined by the scheduler at each iteration, considering the current queue, prompt lengths, and memory limits. The scheduler does not wait for all requests in a batch to finish before starting new ones; instead, it dynamically refills the batch to maximize GPU utilization and minimize latency. This approach is used in both the prefill and decode phases[^5][^6].
Would you like a step-by-step breakdown of the scheduling and batching process?
Sources:
- GitHub Issue: Does the continuous batching technology in the vLLM online service scenario contain the concept of batch size?
- GitHub Issue: Specify the max batch size with continous batching
- GitHub Issue: How many requests can llm.generate handle in parallel?
- Optimization and Tuning
- GitHub Issue: vLLM is 4x faster than HF for offline inference
Yes, I would like to know in detail, how is this dynamic batch organized
Certainly! Here’s a step-by-step breakdown of how vLLM dynamically organizes batches:
-
Request Queueing: When you call
LLM.generate(query_list)
, all prompts are placed into an internal queue. Each request is tracked individually, with its own prompt, sampling parameters, and state (e.g., prefill or decode phase) (source, source). -
Batch Formation: The scheduler examines the queue and selects as many requests as possible to form a batch, constrained by
max_num_seqs
(maximum number of sequences per batch) andmax_num_batched_tokens
(maximum total tokens per batch). The scheduler considers prompt lengths and available GPU memory to avoid exceeding resource limits (source, source, source). -
Continuous Batching: Unlike traditional static batching, vLLM does not wait for all requests in a batch to finish. As soon as a request completes (e.g., finishes generating all tokens), the scheduler immediately fills the vacated slot with a new request from the queue. This ensures the batch is always as full as possible, maximizing GPU utilization (source, source).
-
Prefill and Decode Phases: In the prefill phase, the model processes the initial prompt tokens for each request in the batch. In the decode phase, the model generates one token at a time for each active request. Both phases use the same dynamic batching mechanism, so new requests can be added as soon as there is capacity (source).
-
Batch Scheduling Iterations: At each iteration, the scheduler re-evaluates the queue and the current batch, selecting the optimal set of requests to process next. This may involve mixing requests in different phases (prefill and decode) and chunking long prompts if needed (source, source).
-
Result Ordering: vLLM keeps track of the order of input requests and returns results in the same order, even though requests may finish at different times due to dynamic batching (source).
This dynamic, continuous batching approach is what enables vLLM to achieve high throughput and low latency, especially under high-concurrency workloads.
Sources: