Support for Deploying 4-bit Fine-Tuned Model with LoRA on vLLM

ChiragA22 · July 26, 2025, 6:14am

Hi there,

We’re working on deploying a SOTA reasoning model fine-tuned using 4-bit LoRA adapters, and we’re exploring vLLM for high-performance inference.

However, we’re running into performance degradation when trying to merge the 4-bit fine-tuned model into the base model. As expected, this seems to stem from:

LoRA’s low-rank approximation affects base weights post-merge
Quantization-induced instability
Rounding inconsistencies when merging into fp16/fp32

Questions:

Does vLLM currently support adapter-based 4-bit inference without merging the LoRA layers into the base?
If not, is there a recommended deployment pathway for 4-bit models on vLLM to retain fine-tuned performance?
Would someone from the team be open to helping us optimise this setup? We’d love to share more details and collaborate.

Our model has outperformed GPT, LLaMA, and DeepSeek on reasoning benchmarks, and we’re currently deploying at scale.

Thanks in advance, any guidance would be greatly appreciated!

RunLLM · July 26, 2025, 6:15am

Yes, vLLM now supports 4-bit quantized (bitsandbytes QLoRA) inference with LoRA adapters without merging, but with important caveats. For bitsandbytes QLoRA, you must specify the qlora_adapter_name_or_path at model load time; however, hot-swapping adapters at runtime is not supported—adapter selection is fixed at startup. Only Llama-based models are currently supported for this workflow, and tensor/pipeline parallelism is not yet available for QLoRA in vLLM. For other quantization methods (e.g., GPTQ, AWQ), vLLM supports LoRA adapters and dynamic adapter switching, but for bitsandbytes QLoRA, dynamic switching is not possible at this time. See PR #4776 and LoRA documentation for details.

If you need to deploy a 4-bit LoRA model without merging, use the bitsandbytes QLoRA integration, load the base model with quantization and the adapter path, and run inference as described above. For other quantized models (e.g., GPTQ+LoRA), you can use vLLM’s standard LoRA API for dynamic adapter selection. If you want to discuss optimization or collaboration, the vLLM team is open to contributions and feedback via GitHub issues or discussions. Would you like a step-by-step code example or more detail on the deployment process?

Sources:

jeejeelee · July 28, 2025, 2:46am

vLLM supports this feature, see: vllm/examples/offline_inference/lora_with_quantization_inference.py at main · vllm-project/vllm · GitHub

RunLLM · July 28, 2025, 2:47am

Yes, vLLM supports adapter-based 4-bit (bitsandbytes QLoRA, GPTQ, AWQ) inference without merging LoRA layers into the base model. You can load a quantized model and apply LoRA adapters at inference time, as shown in lora_with_quantization_inference.py. This allows you to retain fine-tuned performance without merging.

For bitsandbytes QLoRA, specify the quantization method and LoRA adapter path at engine initialization. For GPTQ and AWQ, dynamic LoRA adapter switching is also supported. Would you like a code example or more deployment details?

Sources:

LoRA With Quantization Inference Example

ChiragA22 · July 28, 2025, 1:17pm

Hi ,

I’m encountering an issue when trying to load a 4-bit quantized model (GPTQ) along with a LoRA adapter using vLLM. The deployment fails with the following error:

ValueError: base_model.model.model.embed_tokens.weight is unsupported LoRA weight

Full traceback snippet:

ValueError: base_model.model.model.embed_tokens.weight is unsupported LoRA weight
Worker failed with error 'base_model.model.model.embed_tokens.weight is unsupported LoRA weight'

Context:
Model: Our GPTQ 4-bit quantized model (base model - deepseek 32b)
Adapter: A LoRA fine-tuned adapter applied on top of the model

Command used (I have removed the model name due to confidentiality):

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
  --enable-lora \
  --dtype auto \
  --tensor-parallel-size 2 \
  --max-model-len 70000 \
  --lora-modules reasoning=/root/.cache/huggingface/hub/models--169Pi--reasoning_model

Problem:
It seems that vLLM is attempting to insert LoRA layers into a quantized (GPTQ) model, but fails due to incompatible weight formats specifically embed_tokens.weight. I understand that quantized weights may not support LoRA patching depending on the tensor layout or backend support.

I’d really appreciate any guidance on how to make this setup work, or if you can help us set it up, it will be really useful for us.

RunLLM · July 28, 2025, 1:17pm

vLLM does not currently support LoRA adapters that modify the embed_tokens.weight (i.e., embedding layers) for quantized (GPTQ) models. Only standard LoRA target modules (e.g., attention and MLP projections) are supported for quantized models; LoRA adapters that include weights for embed_tokens or lm_head will trigger the “unsupported LoRA weight” error you encountered. This is a known limitation—see issue #2816 and issue #9280.

To resolve this, ensure your LoRA adapter does not include modifications to embed_tokens or lm_head when using quantized models. If you need to expand the vocabulary or modify embeddings, you must provide a separate new_embeddings.safetensors file for new tokens, but not for the base tokens. Would you like more detail on preparing compatible LoRA adapters for quantized models?

Sources:

ChiragA22 · July 28, 2025, 1:29pm

We have fine-tuned deepseek-ai/DeepSeek-R1-Distill-Qwen-32B in 4-bit precision, which is based on the Qwen architecture. While deploying using the specified method, we’re encountering an error: base_model.model.model.embed_tokens.weight is an unsupported LoRA weight. Notably, we haven’t made any modifications to the base architecture.

Can you please help us how can we deploy this model on vLLM?

RunLLM · July 28, 2025, 1:29pm

vLLM does not support LoRA adapters that modify the embed_tokens.weight (embedding layer) for quantized (e.g., GPTQ) models. This is a known limitation—LoRA adapters for quantized models must not include weights for embed_tokens or lm_head, only for supported modules like attention and MLP projections. See issue #2816.

To deploy your model, retrain or export your LoRA adapter so it does not include embed_tokens.weight. If you need to add new tokens, use a separate new_embeddings.safetensors file for the new tokens only. Would you like more detail on preparing a compatible LoRA adapter?

Sources:

vLLM Issue #2816: Multi-Lora with embed_tokens and lm_head in adapter weights

ChiragA22 · July 28, 2025, 2:04pm

We fine-tuned a 4-bit model based on deepseek-ai/DeepSeek-R1-Distill-Qwen-32B using LoRA, without modifying the base architecture or adding any new tokens. During deployment via the standard vLLM process, we encountered the following error:

base_model.model.model.embed_tokens.weight is unsupported LoRA weight

It’s worth noting that we’re using an adapter_model.safetensors file that contains only the LoRA weights not the base model weights. We haven’t made any manual changes to embed_tokens.weight, so it appears that this may have been included automatically during the fine-tuning/export process.

Could you please clarify how to properly export or filter the adapter to avoid this issue with quantized models in vLLM and help us deploy it?

RunLLM · July 28, 2025, 2:04pm

This error occurs because vLLM does not support LoRA adapters that include embed_tokens.weight for quantized (e.g., GPTQ) models, even if you did not manually modify embeddings. Some fine-tuning/export tools (like PEFT) may save base embedding weights by default, which are not compatible with vLLM’s quantized model LoRA support. See issue #2816 and issue #3404.

To fix this, filter out any non-LoRA keys (i.e., those not containing “lora”) from your adapter_model.safetensors before deploying with vLLM. You can use the following code to remove unsupported weights:

import safetensors.torch

src = "YOUR_ADAPTER_PATH/adapter_model.safetensors"
dst = "NEW_ADAPTER_PATH/adapter_model.safetensors"

tensors = safetensors.torch.load_file(src)
non_lora_keys = [k for k in tensors.keys() if "lora" not in k]
for k in non_lora_keys:
    del tensors[k]
safetensors.torch.save_file(tensors, dst)

After this, use the filtered adapter with vLLM. Would you like more detail on this process?

Sources:

ChiragA22 · July 30, 2025, 11:11am

When I deploy our model, it is being deployed, but the answers are coming from the base model. During inference, it is not taking the LORA model into account, instead generating answers from the base model.

my vLLM file:

[Unit]
Description=vLLM Model API
After=network.target

[Service]
User=azureuser
Group=azureuser
WorkingDirectory=/home/azureuser

Environment=HF_HOME=/home/azureuser/.cache/huggingface
Environment="PATH=/home/azureuser/venv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

ExecStart=/home/azureuser/venv/bin/vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
  --enable-lora \
  --lora-modules default=/home/azureuser/.cache/huggingface/hub/models--example--reasoning_model/full \
  --port 8000 \
  --dtype auto \
  --tensor-parallel-size 2 \
  --max-model-len 65000


Restart=always
RestartSec=5
LimitNOFILE=65535

[Install]
WantedBy=multi-user.target

RunLLM · July 30, 2025, 11:11am

If vLLM is serving responses from the base model and not applying your LoRA adapter, the most common causes are: (1) the LoRA adapter contains unsupported weights (e.g., embed_tokens.weight or lm_head.weight), or (2) the LoRA adapter is not being properly loaded due to format or path issues. For quantized models (like GPTQ), vLLM only supports LoRA adapters that modify attention/MLP projections, not embeddings or output heads. See issue #6250 and issue #9280.

To resolve this, filter your adapter to include only LoRA weights (keys containing “lora”) and ensure the adapter path is correct. Also, confirm that your inference requests specify the LoRA adapter as the model (e.g., "model": "default" in the API request). Would you like a step-by-step guide to filter your adapter or check your API usage?

Sources:

ChiragA22 · July 30, 2025, 11:38am

We’ve already filtered our LoRA adapter and it contains only valid LoRA weights (no embed_tokens, lm_head, or modules_to_save), and the adapter path is correct (we’ve verified by printing it during startup).

Despite this, vLLM still responds using the base model and seems to ignore the LoRA adapter. We’re using a 4-bit bitsandbytes quantized model, not GPTQ.

RunLLM · July 30, 2025, 11:39am

For 4-bit bitsandbytes (bnb) quantized models, vLLM only supports loading a single QLoRA adapter at model startup using the --qlora-adapter-name-or-path argument; dynamic LoRA adapter switching or using --lora-modules is not supported for bnb QLoRA. If you use --lora-modules with a bitsandbytes model, the LoRA adapter will be ignored and only the base model will be used for inference. See PR #4776 and LoRA documentation.

To apply your LoRA adapter, restart vLLM with the --qlora-adapter-name-or-path argument pointing to your adapter. If you need dynamic LoRA switching, use GPTQ/AWQ quantization instead. Would you like a code example or more deployment details?

Sources:

Topic		Replies	Views
LoRA Adapter enabling with vLLM is not working LoRA	4	268	April 21, 2025
MoE quantization Quantization	9	743	July 2, 2025
How LoRA (bf16) is fused into a GPTQ 8bit model? General	9	63	June 10, 2025
Gemma 3 Quantization General	5	792	June 21, 2025
RunBot's math-to-text on NVIDIA NeMo Framework AutoModel LoRA	11	50	May 19, 2025

Support for Deploying 4-bit Fine-Tuned Model with LoRA on vLLM

Related topics