Suggestion to improve inferencing speed

vLLM uses a dynamic batching scheduler that does not wait for the batch to fill; it processes available requests immediately, up to the limits set by max-num-batched-tokens and max-num-seqs. This minimizes latency for real-time workloads, as new requests are picked up and processed as soon as resources allow (Optimization and Tuning).

The default scheduling policy is First-Come, First-Served (FCFS), meaning requests are handled in the order they arrive. You can change this to a priority-based policy if needed, but FCFS is the standard for most deployments (SchedulerConfig).

Sources:

Hint: Mention @RunLLM in the post for followups.