How LoRA (bf16) is fused into a GPTQ 8bit model?

dotieuthien · June 10, 2025, 5:44am

Hi guys, I want to ask about how LoRA (bf16) is applied into base LLM model, when using a GPTQ 8-bit quantized model with LoRA adapters enabled, could you please clarify the exact memory flow and data types involved in the fused computation during inference?
As I understand:

GPTQ stores weights in format (8 bits) in global mem (VRAM) and during inference, are these int8 weights dequantized on-the-fly inside the CUDA kernel (fetch 8bit weights).
With LoRA, LoRA weights (bf16) is fetched into the kernel and then fused into dequantized layer inside the CUDA kernel?

I really want to know exactly how data is fetched and procesed?

RunLLM · June 10, 2025, 5:45am

It seems likely that, when using a GPTQ 8-bit quantized model with LoRA adapters in vLLM, the base model weights are stored as INT8 (8 bits) in GPU memory. During inference, these INT8 weights are dequantized on-the-fly inside the CUDA kernel—meaning the kernel fetches the 8-bit weights, applies the quantization scales, and converts them to FP16/BF16 for computation. For LoRA, the LoRA adapter weights are typically stored in BF16 or FP16 and are also fetched into the kernel. The fused computation happens inside the CUDA kernel: the dequantized base weights and the LoRA weights are combined (fused) during the matrix multiplication, so the output reflects both the quantized base and the LoRA adaptation in a single pass. This approach minimizes memory movement and maximizes efficiency by keeping all operations on the GPU and fusing them at the kernel level. The data flow is: fetch INT8 base weights + BF16 LoRA weights → dequantize base weights → fuse with LoRA weights → compute output in FP16/BF16.

My understanding is that this is supported by vLLM’s integration of quantized models with LoRA, as described in LoRA With Quantization Inference and GPTQModel documentation. However, the exact kernel implementation details (such as whether the fusion is always in a single kernel or split across steps) are not exhaustively documented in the public docs, but the described flow matches the intended design for efficiency and accuracy.

Would you like a more detailed breakdown of the kernel operations or code-level specifics?

Sources:

jeejeelee · June 10, 2025, 5:50am

No, LoRA uses separate kernels. See: vllm/vllm/lora at main · vllm-project/vllm · GitHub

RunLLM · June 10, 2025, 5:50am

You are correct: in vLLM, LoRA computations are handled by separate kernels, not fused directly into the quantized (e.g., GPTQ) base model kernel. The base model’s quantized weights (INT8) are dequantized and used in their own kernel, while LoRA weights (typically BF16/FP16) are applied in a separate kernel pass. The outputs are then combined after both computations, rather than fusing LoRA into the quantized kernel itself. This is confirmed by the vllm/lora directory, which contains custom LoRA kernels and logic distinct from quantized model kernels.

Would you like more detail on the memory flow or kernel invocation order?

Sources:

dotieuthien · June 10, 2025, 5:57am

hmmm, so in LoRA kernel, how does base model weights transfer? I mean that if it keep 8bit GPTQ data type and then dequantize bf16 and apply lora inside kernel?

RunLLM · June 10, 2025, 5:57am

It seems likely that in vLLM, the LoRA kernel does not directly operate on the quantized (8-bit) GPTQ weights. Instead, the base model’s INT8 weights are first dequantized (converted to FP16/BF16) in the quantized kernel, and the result of this computation is then passed to the LoRA kernel. The LoRA kernel receives the already dequantized activations (not the raw INT8 weights), and applies the LoRA adaptation using its own BF16/FP16 weights. So, the LoRA kernel does not handle INT8 weights or dequantization itself; it only works with higher-precision data after the base computation.

This separation is implied by the structure of vLLM’s LoRA implementation and is discussed in vllm/lora and related PRs, where LoRA logic is applied after the main model’s forward pass, not during quantized weight dequantization.

Would you like a step-by-step breakdown of the data flow between kernels?

Sources:

https://github.com/vllm-project/vllm/pull/4776

dotieuthien · June 10, 2025, 6:10am

@jeejeelee could you please explain me the data type of base model in LoRA kernel?

RunLLM · June 10, 2025, 6:10am

The base model data type in the LoRA kernel is FP16 or BF16, not INT8. The LoRA kernel receives the activations or outputs from the base model after the quantized INT8 weights have already been dequantized to FP16/BF16 in the main model kernel. The LoRA kernel then applies its adaptation using these higher-precision activations and its own FP16/BF16 weights, never directly handling INT8 data.

Would you like a more detailed step-by-step explanation?

Sources:

PR: Support bitsandbytes quantization and QLoRA

jeejeelee · June 10, 2025, 6:10am

The computation of the base model and LoRA are separate, so LoRA kernels don’t involve any dequantization of base model weights. see: base_model forward and lora forward

RunLLM · June 10, 2025, 6:11am

Correct. In vLLM, the computation for the base model and LoRA are separate: the base model’s quantized weights are dequantized and used in the base model’s forward pass, producing activations in FP16/BF16. The LoRA kernel then operates only on these activations and its own FP16/BF16 weights—there is no dequantization of base model weights inside the LoRA kernel. The LoRA kernel never sees INT8 data; it only works with higher-precision tensors output by the base model’s forward.

Would you like a step-by-step breakdown of the forward pass?

Sources:

PR: Support bitsandbytes quantization and QLoRA

Topic		Replies	Views
RunBot's math-to-text on NVIDIA NeMo Framework AutoModel LoRA	11	26	May 19, 2025
[Bug]: Wrong lora mapping during prompt logprobs computing General	1	25	April 21, 2025
Can Lora adapters be loaded on different GPUs LoRA	1	30	April 7, 2025
VUA - library code for LLM inference engines for external storage of KV caches KV-Cache	1	17	May 13, 2025
`computeCapability not supported` error with LoRA adapter on 5090 General	1	11	June 10, 2025

How LoRA (bf16) is fused into a GPTQ 8bit model?

Related topics