Gemma 3 prefix caching in case of multimodal prompts

kimihailv · May 22, 2025, 9:57am

In the vLLM implementation of Gemma 3, particularly when handling multimodal prompts (has_images=True), the Gemma3Attention module employs a specific logic during its forward pass. During the decode phase:

The qkv_proj layer computes Query (Q), Key (K), and Value (V) based on the hidden_states of the current input token(s). In the decode phase, this means q, k, and v initially represent only the Q, K, V for the single new token being processed.
```
# In Gemma3Attention.forward
qkv, _ = self.qkv_proj(hidden_states) # hidden_states is for current token(s) in decode
q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
```
The self.attn(q, k, v) (vLLM’s optimized Attention backend) is called. For multimodal inputs, the comment explicitly states its role:
```
# In Gemma3Attention.forward, when has_images is True:
attn_output = self.attn(q, k, v)
# ...
# "The call to self.attn(q, k, v) is only used to populate the KV cache - its
# output is discarded and overwritten below."
```
This implies that self.attn correctly appends the K and V of the current new token to the existing vLLM KV cache, which holds the KV history from prefill and previous decode steps.

Subsequently, naive_attn_with_masks is called to compute the actual attention output:

# In Gemma3Attention.forward, when has_images is True:
attn_output = self.naive_attn_with_masks(q, # Current token's Q
                                         k, # Current token's K
                                         v, # Current token's V
                                         out=attn_output,
                                         **kwargs)

Inside naive_attn_with_masks, the attention is computed per sequence in the batch using torch.nn.functional.scaled_dot_product_attention:

# In Gemma3Attention.naive_attn_with_masks
def naive_attn_with_masks(
    self,
    q: torch.Tensor, # Expected to be current token's Q in decode
    k: torch.Tensor, # Expected to be current token's K in decode
    v: torch.Tensor, # Expected to be current token's V in decode
    out: torch.Tensor,
    **kwargs,
) -> torch.Tensor:
    # ... (reshaping of q, k, v)
    # k and v are reshaped from the input arguments
    # k_reshaped = k.view(-1, self.num_kv_heads, self.head_dim)
    # v_reshaped = v.view(-1, self.num_kv_heads, self.head_dim)

    for seq_len, attn_mask in zip(seq_lens, attn_masks): # seq_lens from kwargs
        start_idx = # ... calculated based on seq_lens
        end_idx = start_idx + seq_len

        # query, key, value are sliced based on start_idx, end_idx from the reshaped q, k, v
        query = q_reshaped[start_idx:end_idx].unsqueeze(0)
        key = k_reshaped[start_idx:end_idx].unsqueeze(0)     # (A)
        value = v_reshaped[start_idx:end_idx].unsqueeze(0) # (B)

        output = F.scaled_dot_product_attention(
            query, # Current token's Q for this sequence
            key,   # Needs to be full KV history for this sequence
            value, # Needs to be full KV history for this sequence
            attn_mask,
            self.scaling,
        )
        # ...
    return out

The core of the question is:

Given that F.scaled_dot_product_attention requires the key and value arguments to contain the full KV history for the sequence up to the current token, but the k and v arguments passed into naive_attn_with_masks during decode are (as per step 1) only for the current new token:

How do the key and value tensors, which are derived from the input k and v to naive_attn_with_masks, effectively represent or gain access to the full KV history (i.e., KVs from prefill and all prior decode steps for that specific sequence) that is stored in the vLLM KV cache (managed by self.attn)?

RunLLM · May 22, 2025, 9:57am

In the vLLM implementation of Gemma3, during the decode phase with multimodal inputs (has_images=True), the self.attn(q, k, v) call is used only to update the vLLM-managed KV cache with the new token’s K and V; its output is discarded. The subsequent call to naive_attn_with_masks(q, k, v, ...) receives only the current token’s Q, K, and V, not the full KV history. Therefore, the K and V tensors inside naive_attn_with_masks do not automatically include the full KV history—they only represent the current token’s K and V, as passed in from the forward pass.

As a result, the actual attention computation in naive_attn_with_masks during decode does not have access to the full KV history from the vLLM cache; it only attends to the current token. This is a known limitation and is explicitly mentioned in the code comments: the call to self.attn(q, k, v) is solely for cache population, and the naive attention is a temporary workaround for multimodal masking, not a performant or fully correct implementation for long-context attention in decode. The full KV history is not retrieved from the vLLM cache for use in the naive attention path—this would require additional logic to reconstruct the full K and V tensors from the cache, which is not present in the current implementation (Gemma3 source).

Would you like a step-by-step breakdown of how the KV cache is managed and why the naive attention path does not access it?

Sources:

Gemma3 source

kimihailv · May 22, 2025, 10:00am

but the attention to the only current token is completely meaningless. I don’t think your explanation is correct

RunLLM · May 22, 2025, 10:01am

You are correct: attending only to the current token would be meaningless, and the intent of the code is for the attention computation in naive_attn_with_masks to use the full KV history. However, based on the provided Gemma3 source, the naive_attn_with_masks function receives only the current token’s K and V as arguments during decode, and there is no explicit code in this function to fetch or reconstruct the full KV history from the vLLM-managed cache.

The code comments acknowledge this as a temporary, non-performant workaround: the call to self.attn(q, k, v) is used to update the KV cache, but the actual attention computation in naive_attn_with_masks does not access the cache and thus cannot attend to the full sequence. This means that, as currently implemented, the attention in the multimodal decode path is not correct for long-context scenarios—this is a known limitation and is marked for future optimization in the code comments.

Would you like a detailed step-by-step explanation of how this limitation arises in the code?

Sources:

Gemma3 source

kimihailv · May 22, 2025, 10:03am

I want a confirmation from a human

Topic		Replies	Views
In single node deployment environment, how can we make vllm call unified_attention more often to trigger KVCache connector workload General	12	87	June 24, 2025
Gemma 3 Quantization General	5	868	June 21, 2025
Custom KV cache implementation General	1	120	July 19, 2025
Support for (sparse) key value caching KV-Cache	16	268	May 3, 2025
Is there a plan to support cross-attention'd multimodal LLM? General	3	59	June 1, 2025

Gemma 3 prefix caching in case of multimodal prompts

Related topics