LoRA requests are causing generate to be extremely slow

hjh · September 15, 2025, 7:21am

When I add a LoRA adapter to the model with the add_lora interface and include a lora_request in generate, the generation becomes very slow. Is this expected?

RunLLM · September 15, 2025, 7:22am

Yes, it is expected that generation becomes slower when using a LoRA adapter in vLLM, especially for long context inputs or if certain parameters are not set. The slowdown is due to additional overhead from LoRA kernel operations and, if the sequence length exceeds the default max-seq-len-to-capture (8192), vLLM falls back to eager mode, which is much slower. Setting --max-seq-len-to-capture to match your maximum sequence length can significantly improve performance. For example, users have reported dramatic speed drops (e.g., from 27 tokens/s to 5 tokens/s) when enabling LoRA, which were resolved by adjusting this parameter. There is also a general overhead (about 10%) even when LoRA is enabled but not used, due to cudagraph requirements for LoRA kernels. See this discussion and this performance report.

Would you like more detail on how to optimize LoRA performance in vLLM?

Sources:

jeejeelee · September 15, 2025, 7:34am

Could you pleas provibe your script

RunLLM · September 15, 2025, 7:35am

The script referenced in the performance discussions is the standard vLLM generation script with LoRA enabled. Here is a minimal example based on the official documentation and user reports:

from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True)
sql_lora_path = "/path/to/your/lora/adapter"

prompts = [
    "Your prompt here."
]

sampling_params = SamplingParams(temperature=0.8, max_tokens=128)

outputs = llm.generate(
    prompts,
    sampling_params,
    lora_request=LoRARequest("sql_adapter", 1, sql_lora_path)
)

for output in outputs:
    print(output.outputs[0].text)

For best performance with long contexts, start the server with --max-seq-len-to-capture set to your maximum sequence length, e.g.:

vllm serve meta-llama/Llama-2-7b-hf --enable-lora --max-seq-len-to-capture 48000

See LoRA documentation and performance issue discussion for more details.

Would you like a more advanced example or async usage?

Sources:

https://docs.vllm.ai/en/latest/features/lora.html

hjh · September 15, 2025, 7:43am

I am not using the vLLM scripts directly.
Following the RLHF example, I send LoRA weight updates to the vLLM engine inside my GRPOTrainer and activate the adapter at generation time by passing a lora_request.

What I observe is that when I submit multiple requests (e.g. 4 or 8), the first request is extremely slow, for example: 25%|██▌ | 1/4 [01:48<05:25, 108.57s/it]

25%|██▌ | 1/4 [01:48<05:25, 108.57s/it]

100%|██████████| 4/4 [01:53<00:00, 28.36s/it]

I’m not sure whether this is caused by my implementation; I’m currently profiling to locate the bottleneck.
My implementation: https://github.com/modelscope/ms-swift/pull/5773

RunLLM · September 15, 2025, 7:43am

The slow first request when dynamically loading or updating a LoRA adapter in vLLM is expected. The initial request triggers loading and optimization of the LoRA weights, which can be slow due to CPU contention, memory pinning, and CUDA graph capturing. Subsequent requests are much faster because the adapter is already loaded and optimized. This behavior is documented in multiple vLLM issues and is not specific to your implementation—see vLLM issue #5564 and vLLM issue #6072.

To mitigate this, you can “warm up” the LoRA adapter by running a dummy generation immediately after loading or updating the adapter, so that the first real request does not pay the full loading/optimization cost.

Would you like more detail on profiling or warmup strategies?

Sources:

Topic		Replies	Views
LoRA Adapter enabling with vLLM is not working LoRA	4	249	April 21, 2025
Clarification: Does vLLM support concurrent decoding with multiple LoRA adapters in online inference? LoRA	1	77	August 29, 2025
Support for Deploying 4-bit Fine-Tuned Model with LoRA on vLLM Quantization	13	170	July 30, 2025
A question about request handling General	5	92	August 21, 2025
Text generation doesn't stop General	1	206	April 8, 2025

LoRA requests are causing generate to be extremely slow

Related topics