When I add a LoRA adapter to the model with the add_lora
interface and include a lora_request
in generate
, the generation becomes very slow. Is this expected?
Yes, it is expected that generation becomes slower when using a LoRA adapter in vLLM, especially for long context inputs or if certain parameters are not set. The slowdown is due to additional overhead from LoRA kernel operations and, if the sequence length exceeds the default max-seq-len-to-capture
(8192), vLLM falls back to eager mode, which is much slower. Setting --max-seq-len-to-capture
to match your maximum sequence length can significantly improve performance. For example, users have reported dramatic speed drops (e.g., from 27 tokens/s to 5 tokens/s) when enabling LoRA, which were resolved by adjusting this parameter. There is also a general overhead (about 10%) even when LoRA is enabled but not used, due to cudagraph requirements for LoRA kernels. See this discussion and this performance report.
Would you like more detail on how to optimize LoRA performance in vLLM?
Sources:
Could you pleas provibe your script
The script referenced in the performance discussions is the standard vLLM generation script with LoRA enabled. Here is a minimal example based on the official documentation and user reports:
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True)
sql_lora_path = "/path/to/your/lora/adapter"
prompts = [
"Your prompt here."
]
sampling_params = SamplingParams(temperature=0.8, max_tokens=128)
outputs = llm.generate(
prompts,
sampling_params,
lora_request=LoRARequest("sql_adapter", 1, sql_lora_path)
)
for output in outputs:
print(output.outputs[0].text)
For best performance with long contexts, start the server with --max-seq-len-to-capture
set to your maximum sequence length, e.g.:
vllm serve meta-llama/Llama-2-7b-hf --enable-lora --max-seq-len-to-capture 48000
See LoRA documentation and performance issue discussion for more details.
Would you like a more advanced example or async usage?
Sources:
I am not using the vLLM scripts directly.
Following the RLHF example, I send LoRA weight updates to the vLLM engine inside my GRPOTrainer
and activate the adapter at generation time by passing a lora_request
.
What I observe is that when I submit multiple requests (e.g. 4 or 8), the first request is extremely slow, for example: 25%|██▌ | 1/4 [01:48<05:25, 108.57s/it]
25%|██▌ | 1/4 [01:48<05:25, 108.57s/it]
100%|██████████| 4/4 [01:53<00:00, 28.36s/it]
100%|██████████| 4/4 [01:53<00:00, 28.36s/it]
I’m not sure whether this is caused by my implementation; I’m currently profiling to locate the bottleneck.
My implementation: https://github.com/modelscope/ms-swift/pull/5773
The slow first request when dynamically loading or updating a LoRA adapter in vLLM is expected. The initial request triggers loading and optimization of the LoRA weights, which can be slow due to CPU contention, memory pinning, and CUDA graph capturing. Subsequent requests are much faster because the adapter is already loaded and optimized. This behavior is documented in multiple vLLM issues and is not specific to your implementation—see vLLM issue #5564 and vLLM issue #6072.
To mitigate this, you can “warm up” the LoRA adapter by running a dummy generation immediately after loading or updating the adapter, so that the first real request does not pay the full loading/optimization cost.
Would you like more detail on profiling or warmup strategies?
Sources: