I’m experimenting with multi-LoRA serving in vLLM and I’d like clarification on how batching works in practice.
Setup:
Started vLLM with --enable-lora --max-lora 2.
Loaded two LoRA adapters (say adapter-A and adapter-B).
A request using adapter-A is already being decoded. While that’s running, a new request using adapter-B arrives.
Questions:
Will vLLM merge and decode both requests concurrently in the same batch, even though they use different LoRA adapters? And what about the prefill?
If yes, what is the expected overhead compared to batching requests that all use the same adapter? (e.g., kernel fusion, throughput impact). How would the expected latency compare to if it had two parallel instances of VLLM running, serving one adapter each(assuming no memory constraints in either situation)?
I’ve seen the multi-LoRA offline example in the docs, but I couldn’t find a clear explanation of whether online continuous batching supports heterogeneous LoRA adapters.
This would be very useful to understand for multi-tenant workloads, where different users might each use a different LoRA adapter but still benefit from batching efficiency.
Thanks a lot for the amazing project and for clarifying this!
Yes, vLLM supports batching requests with different LoRA adapters in the same batch, both in prefill and decode phases, as long as --max-loras is set high enough to accommodate the number of unique adapters in the batch. This is explicitly supported and is a key feature for multi-tenant workloads; requests using different LoRA adapters can be merged and processed concurrently in a single batch, leveraging efficient kernel implementations (e.g., Punica batched gemm kernels) to minimize overhead compared to single-adapter batching. However, there is still some performance overhead: throughput and latency are slightly worse than batching requests with the same adapter, but the impact is generally modest and much less than running separate vLLM instances per adapter, especially when memory is not a bottleneck. The overhead is mainly due to extra computation and memory operations for handling multiple adapters, but vLLM is optimized for this use case (docs, example, discussion).
In summary, online continuous batching in vLLM does support heterogeneous LoRA adapters, and you can expect better batching efficiency and resource utilization than running separate instances, with only a small throughput/latency penalty compared to homogeneous batches. For more details on performance impact and kernel fusion, see the multi-LoRA example and related discussions. Would you like a step-by-step breakdown or code example?