In single node deployment environment, how can we make vllm call unified_attention more often to trigger KVCache connector workload

gaowayne · June 22, 2025, 3:10am

expert
I enable vllm server like this.

VLLM_LOGGING_LEVEL=INFO VLLM_ATTENTION_BACKEND=FLASH_ATTN VLLM_USE_V1=1  bin/vua-vllm serve  mistralai/Mistral-7B-Instruct-v0.2  --gpu-memory-utilization 0.90 --tensor-parallel-size 2  --kv-transfer-config '{"kv_connector":"VUAStorageConnector_V1","kv_role":"kv_both","kv_connector_extra_config": {"shared_storage_path": "/mnt/nvme1n1/nfsrdma/test1"}}'  --no-enable-prefix-caching --max-num-batched-tokens 30720  --port 8000

then I benchmark with this

VLLM_LOGGING_LEVEL=INFO bin/vua-vllm bench serve --model mistralai/Mistral-7B-Instruct-v0.2 --random-input-len 20000 --random-output-len 4000  --num-prompt 20

I did not see below method is called often. what I am missing, how can I make it call?

def unified_attention(
    query: torch.Tensor,
    key: torch.Tensor,
    value: torch.Tensor,
    layer_name: str,
) -> torch.Tensor:
    wait_for_kv_layer_from_connector(layer_name)

    forward_context: ForwardContext = get_forward_context()
    attn_metadata = forward_context.attn_metadata
    if isinstance(attn_metadata, dict):
        attn_metadata = attn_metadata[layer_name]
    self = forward_context.no_compile_layers[layer_name]
    kv_cache = self.kv_cache[forward_context.virtual_engine]
    output = self.impl.forward(self, query, key, value, kv_cache,
                               attn_metadata)

    maybe_save_kv_layer_to_connector(layer_name, kv_cache)
    return output

RunLLM · June 22, 2025, 3:10am

It seems likely that unified_attention is not called often because, in vLLM V1 with the FLASH_ATTN backend, the attention computation may use a different custom op or backend-specific implementation, bypassing the generic unified_attention Python function. The vLLM V1 engine and certain backends (like FlashAttention) register their own custom ops for attention, and may not invoke this Python-level function directly during normal inference or benchmarking runs, especially when using direct CUDA/FlashAttention kernels or when use_direct_call is set to True in the attention layer implementation. This is supported by the vLLM API docs and the code in vllm/attention/layer.py, which show that the call path can differ based on backend and configuration.

If you want to ensure unified_attention is called, you may need to use a backend or configuration that does not support direct CUDA calls or disables custom op registration, or explicitly set use_direct_call to False. However, the exact conditions for this are not fully detailed in the available documentation, and the V1 engine with FlashAttention is designed to use its own optimized path. Would you like a step-by-step breakdown of how the call path is determined in vLLM for different backends and configurations?

Sources:

vllm/attention/layer.py

gaowayne · June 22, 2025, 3:23am

this is great information. Yeah, my goal is that during my benchmark run, I hope all prefill and decode can invoke this kvcache save logic, unified_attention I found is only path for that, why? does this means that for connector V1, currently no way to to save KVCache out? it will be great that you can share me step by spte breakdown and how to configure to make the save happen easily

RunLLM · June 22, 2025, 3:23am

Currently, in vLLM V1, the saving of KV cache to an external connector (such as via save_kv_layer or similar logic) is handled by the backend-specific attention implementation, not always through a generic unified_attention function. For the FLASH_ATTN backend, the attention computation and KV cache save/load are typically performed inside the backend’s custom CUDA kernels or their Python wrappers, and the connector’s save_kv_layer is called from within these backend-specific paths, not from a single unified Python function. The V1 connector API (e.g., LMCacheConnectorV1.save_kv_layer) is designed to be called from the attention layer, but the actual call site depends on the backend and engine configuration. There is no guarantee that your custom unified_attention function will be invoked unless you modify the backend to use it explicitly or use a backend that routes all attention calls through this function. See LMCacheConnectorV1.save_kv_layer and KVConnectorBase_V1.save_kv_layer.

Step-by-step breakdown:

When a request is processed, vLLM’s attention layer (e.g., FlashAttention) is called.
If the backend supports direct CUDA calls (as with FLASH_ATTN), it may bypass Python-level hooks like unified_attention and call the custom op directly.
The backend is responsible for calling the connector’s save_kv_layer at the appropriate time, usually after the attention computation.
If you want to ensure your KV cache save logic is invoked, you must either:
- Use a backend or configuration that routes attention through your function, or
- Patch the backend’s attention implementation to call your logic.
There is no configuration flag in V1 that forces all attention calls through a single Python function like unified_attention for all backends.

Would you like more detail on how to patch the backend or configure vLLM to ensure your save logic is always called?

Sources:

gaowayne · June 22, 2025, 3:38am

I understand now, I do not need concerns unifiy_attention method, can you show me flash attn backend code, how it invoke KVconnector method we configured? do we need do some thing there to see most stress workload on the kv save and load into external KVConnector

RunLLM · June 22, 2025, 3:38am

It seems likely that in the FlashAttention backend, the connector’s save_kv_layer and start_load_kv methods are invoked from within the backend’s attention implementation, not via a generic unified_attention function. Specifically, in vLLM V1, the FlashAttention backend calls these connector methods at key points in the attention computation, such as after writing to the KV cache or before reading from it. For example, the LMCacheConnectorV1 class implements save_kv_layer and start_load_kv, and these are called from the attention layer or model runner when configured.

To maximize stress on KV save/load, you should ensure your benchmark uses long prompts and outputs (as in your example), and that your connector is properly configured. No additional code changes are needed if you use a V1 connector and the backend is set up to call these methods. You can monitor or add logging in the connector’s save_kv_layer and start_load_kv methods to confirm they are being called under heavy load. For reference, see the connector logic in lmcache_connector.py.

Would you like a step-by-step code path breakdown for how FlashAttention triggers these connector methods?

Sources:

lmcache_connector.py

gaowayne · June 22, 2025, 4:09am

yeah, please give me detail guide step by step

gaowayne · June 22, 2025, 4:39am

if we can make it happen by set set use_direct_call as false?
# For cuda-alike (CUDA and ROCM) and cpu platforms, we control how
# torch.compile works by registering the attention as one giant
# opaque custom op. For other platforms, we directly call them
# and let torch.compile handle them.
self.use_direct_call = not current_platform.is_cuda_alike(
) and not current_platform.is_cpu()

RunLLM · June 22, 2025, 4:39am

Setting use_direct_call = False on CUDA or CPU platforms will not force vLLM to route attention through a Python function like unified_attention. On CUDA, vLLM still uses a custom op (e.g., FlashAttention kernel) for performance, and the connector save/load logic is handled inside the backend implementation, not by toggling use_direct_call. The use_direct_call flag only affects how torch.compile wraps the attention op, not whether connector hooks are called or which Python function is used as the entry point. See vllm/attention/layer.py for the relevant logic.

If your goal is to stress or test the connector’s save_kv_layer and start_load_kv, you do not need to change use_direct_call—just ensure your connector is configured and add logging to its methods. Would you like a step-by-step trace of the call path for KV connector hooks in the FlashAttention backend?

Sources:

vllm/attention/layer.py

gaowayne · June 24, 2025, 8:49am

I still trouble with this problem VLLM_LOGGING_LEVEL=INFO VLLM_ATTENTION_BACKEND=FLASH_ATTN VLLM_USE_V1=1 bin/vua-vllm serve mistralai/Mistral-7B-Instruct-v0.2 --gpu-memory-utilization 0.90 --tensor-parallel-size 2 --kv-transfer-config ‘{“kv_connector”:“VUAStorageConnector_V1”,“kv_role”:“kv_both”,“kv_connector_extra_config”: {“shared_storage_path”: “/mnt/nvme1n1/nfsrdma/test1”}}’ --no-enable-prefix-caching --max-num-batched-tokens 30720 --port 8000 I cannot see the maybe save kv method is called, how can I know who and when kv save will be invoked, if not called what is reason for a single node deployment?

RunLLM · June 24, 2025, 8:50am

In vLLM V1 with the FLASH_ATTN backend, the KV cache save logic (e.g., save_kv_layer or “maybe save kv”) is invoked by the backend-specific attention implementation, not by a generic Python function. If you do not see your connector’s save method being called, possible reasons include: (1) the backend is not configured to use your custom connector, (2) the connector is not properly registered, or (3) the backend’s attention path (especially with FlashAttention) bypasses Python hooks for performance and handles KV cache internally. In a single-node deployment, if prefix caching is disabled (--no-enable-prefix-caching), and there is no external KV cache sharing or multi-node setup, the connector’s save/load methods may not be triggered at all, as the default behavior is to keep KV cache in GPU memory for local use only.

To confirm when and if KV save is called, add logging or breakpoints in your connector’s save_kv_layer method. Also, check that your connector is correctly implemented and registered in vLLM. For more details, see LMCacheConnectorV1.save_kv_layer and SharedStorageConnector.save_kv_layer. If you want to force external KV cache saving in a single-node setup, you may need to enable prefix caching or use a multi-node/disaggregated setup.

Would you like a step-by-step guide to verify connector registration and logging?

Sources:

gaowayne · June 24, 2025, 9:03am

I actually after I run benchmark for one night, it did save out some tensor files by layer, just not that much number. if it is not registered well, I should not see any tensor files, but I do see a few, what is the problem?

RunLLM · June 24, 2025, 9:04am

If you see some tensor files saved by layer after running your benchmark, your connector is registered and working, but the save logic is only triggered for certain requests or under specific cache conditions. In vLLM V1, the connector’s save_kv_layer is called when a new unique prompt (or block) is processed and needs to be cached externally—if the same prompt is repeated, or if the cache is not full, fewer saves will occur. The number of saved files depends on the diversity of prompts, cache eviction, and whether prefix caching or external cache sharing is enabled. In a single-node setup with no prefix caching, only a limited number of saves may be triggered, especially if your benchmark reuses similar prompts or the cache is large enough to avoid frequent evictions.

If you want to increase the number of KV saves, try using more diverse prompts, reducing cache size, or enabling prefix caching. For more details, see SharedStorageConnector.save_kv_layer.

Would you like a step-by-step explanation of when and why save_kv_layer is called?

Sources:

SharedStorageConnector.save_kv_layer

Topic		Replies	Views
Support for (sparse) key value caching KV-Cache	16	59	May 3, 2025
Kv cache when disable Prefix Caching General	1	109	May 22, 2025
VUA - library code for LLM inference engines for external storage of KV caches KV-Cache	1	19	May 13, 2025
Does vllm support deploy multiple docker instance on one gpu General	1	97	May 21, 2025
Gemma 3 prefix caching in case of multimodal prompts Model Support	4	46	May 22, 2025

In single node deployment environment, how can we make vllm call unified_attention more often to trigger KVCache connector workload

Related topics