How LoRA (bf16) is fused into a GPTQ 8bit model?

Hi guys, I want to ask about how LoRA (bf16) is applied into base LLM model, when using a GPTQ 8-bit quantized model with LoRA adapters enabled, could you please clarify the exact memory flow and data types involved in the fused computation during inference?
As I understand:

  • GPTQ stores weights in format (8 bits) in global mem (VRAM) and during inference, are these int8 weights dequantized on-the-fly inside the CUDA kernel (fetch 8bit weights).
  • With LoRA, LoRA weights (bf16) is fetched into the kernel and then fused into dequantized layer inside the CUDA kernel?

I really want to know exactly how data is fetched and procesed?

It seems likely that, when using a GPTQ 8-bit quantized model with LoRA adapters in vLLM, the base model weights are stored as INT8 (8 bits) in GPU memory. During inference, these INT8 weights are dequantized on-the-fly inside the CUDA kernel—meaning the kernel fetches the 8-bit weights, applies the quantization scales, and converts them to FP16/BF16 for computation. For LoRA, the LoRA adapter weights are typically stored in BF16 or FP16 and are also fetched into the kernel. The fused computation happens inside the CUDA kernel: the dequantized base weights and the LoRA weights are combined (fused) during the matrix multiplication, so the output reflects both the quantized base and the LoRA adaptation in a single pass. This approach minimizes memory movement and maximizes efficiency by keeping all operations on the GPU and fusing them at the kernel level. The data flow is: fetch INT8 base weights + BF16 LoRA weights → dequantize base weights → fuse with LoRA weights → compute output in FP16/BF16.

My understanding is that this is supported by vLLM’s integration of quantized models with LoRA, as described in LoRA With Quantization Inference and GPTQModel documentation. However, the exact kernel implementation details (such as whether the fusion is always in a single kernel or split across steps) are not exhaustively documented in the public docs, but the described flow matches the intended design for efficiency and accuracy.

Would you like a more detailed breakdown of the kernel operations or code-level specifics?

Sources:

No, LoRA uses separate kernels. See: vllm/vllm/lora at main · vllm-project/vllm · GitHub

1 Like

You are correct: in vLLM, LoRA computations are handled by separate kernels, not fused directly into the quantized (e.g., GPTQ) base model kernel. The base model’s quantized weights (INT8) are dequantized and used in their own kernel, while LoRA weights (typically BF16/FP16) are applied in a separate kernel pass. The outputs are then combined after both computations, rather than fusing LoRA into the quantized kernel itself. This is confirmed by the vllm/lora directory, which contains custom LoRA kernels and logic distinct from quantized model kernels.

Would you like more detail on the memory flow or kernel invocation order?

Sources:

hmmm, so in LoRA kernel, how does base model weights transfer? I mean that if it keep 8bit GPTQ data type and then dequantize bf16 and apply lora inside kernel?

It seems likely that in vLLM, the LoRA kernel does not directly operate on the quantized (8-bit) GPTQ weights. Instead, the base model’s INT8 weights are first dequantized (converted to FP16/BF16) in the quantized kernel, and the result of this computation is then passed to the LoRA kernel. The LoRA kernel receives the already dequantized activations (not the raw INT8 weights), and applies the LoRA adaptation using its own BF16/FP16 weights. So, the LoRA kernel does not handle INT8 weights or dequantization itself; it only works with higher-precision data after the base computation.

This separation is implied by the structure of vLLM’s LoRA implementation and is discussed in vllm/lora and related PRs, where LoRA logic is applied after the main model’s forward pass, not during quantized weight dequantization.

Would you like a step-by-step breakdown of the data flow between kernels?

Sources:

@jeejeelee could you please explain me the data type of base model in LoRA kernel?

The base model data type in the LoRA kernel is FP16 or BF16, not INT8. The LoRA kernel receives the activations or outputs from the base model after the quantized INT8 weights have already been dequantized to FP16/BF16 in the main model kernel. The LoRA kernel then applies its adaptation using these higher-precision activations and its own FP16/BF16 weights, never directly handling INT8 data.

Would you like a more detailed step-by-step explanation?

Sources:

The computation of the base model and LoRA are separate, so LoRA kernels don’t involve any dequantization of base model weights. see: base_model forward and lora forward

Correct. In vLLM, the computation for the base model and LoRA are separate: the base model’s quantized weights are dequantized and used in the base model’s forward pass, producing activations in FP16/BF16. The LoRA kernel then operates only on these activations and its own FP16/BF16 weights—there is no dequantization of base model weights inside the LoRA kernel. The LoRA kernel never sees INT8 data; it only works with higher-precision tensors output by the base model’s forward.

Would you like a step-by-step breakdown of the forward pass?

Sources: