Best practice for synchronizing KVCache offloading across ranks (asking in context of tp)?

palakg11 · April 13, 2026, 4:52pm

Hi everyone,

I’m developing a vLLM plugin designed to offload KVCache to external storage. Since each rank handles its cache independently, I need a reliable way to confirm that all ranks (all tps) have finished their respective save operations before proceeding.

I’m considering using a distributed barrier, but I have a few questions:

Is there a recommended way to access the existing process group/distributed framework within a vLLM plugin to call a barrier?
Does vLLM provide a built-in utility for rank synchronization that is preferred over torch.distributed.barrier()?

RunLLM · April 13, 2026, 4:53pm

Yes, vLLM provides its own distributed synchronization utilities. The recommended way is to use the process group abstraction available via get_world_group() from vllm.distributed.parallel_state, which exposes both a device group (NCCL) and a CPU group (GLOO). For control-plane synchronization like barriers, you should use the CPU group (GLOO backend), as using NCCL for barriers is discouraged due to stability issues. You can call get_world_group().cpu_group and then use torch.distributed.barrier(group=cpu_group) for synchronization, or use vLLM’s higher-level barrier utility if available in your context.

Alternatively, vLLM provides a StatelessProcessGroup.barrier() method, which implements a robust, multi-phase barrier for synchronizing all ranks and is designed for metadata/control-plane use. This is preferred over directly calling torch.distributed.barrier() with NCCL, especially in plugin or extension code, to avoid device context issues and deadlocks. See vllm.distributed.utils.StatelessProcessGroup.barrier() and torchrun example for usage.

Would you like a code example or more detail on integrating this into your plugin?

Sources:

Hint: Mention @RunLLM in the post for followups.

palakg11 · April 13, 2026, 5:03pm

@RunLLM I am concerned that in a high throughut scenario barrier is going to tank my performance. Is there any other way?

RunLLM · April 13, 2026, 5:03pm

You are correct: using a barrier will force all ranks to wait for the slowest, which can reduce throughput, especially in high-throughput or asynchronous scenarios. vLLM does not provide a more granular or non-blocking synchronization primitive out of the box; the barrier is an all-or-nothing synchronization point designed for correctness, not performance. If you need finer control, you would need to implement custom coordination logic (e.g., using distributed key-value stores or asynchronous notifications), but this is not directly supported by vLLM’s public APIs.

Would you like more detail on possible custom approaches or examples of how to implement non-blocking coordination?

Sources:

vllm/distributed/utils.py

Hint: Mention @RunLLM in the post for followups.

palakg11 · April 13, 2026, 5:05pm

@runLLM Sure, can you give one example?

RunLLM · April 13, 2026, 5:05pm

Certainly! One common custom approach is to use a distributed key-value store (like PyTorch’s torch.distributed.TCPStore) to implement a lightweight “all-done” check: each rank writes a key when it finishes, and then periodically checks if all expected keys are present before proceeding. This avoids blocking all ranks at once and allows for asynchronous progress.

Would you like a step-by-step code example for this pattern?

Sources:

vllm/distributed/utils.py

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
vLLM on 4 nodes fails randomly General	1	119	April 6, 2026
Intergate async-llm to openrlfh, when tensor_parallel_size > 1 will cause ray-error OpenRLHF	1	237	April 27, 2025
Vllm中，deepseek的模型刷新kvcache的地方在哪 General	44	948	March 30, 2026
Data parallel and kv cache with multiple GPUs General	2	409	September 23, 2025
Pp8并行，update_from_output 会等所有rank的 model_executor.execute_model 执行完了之后才会执行吗 General	84	2107	January 8, 2026

Best practice for synchronizing KVCache offloading across ranks (asking in context of tp)?

Related topics