I’m developing a vLLM plugin designed to offload KVCache to external storage. Since each rank handles its cache independently, I need a reliable way to confirm that all ranks (all tps) have finished their respective save operations before proceeding.
I’m considering using a distributed barrier, but I have a few questions:
Is there a recommended way to access the existing process group/distributed framework within a vLLM plugin to call a barrier?
Does vLLM provide a built-in utility for rank synchronization that is preferred over torch.distributed.barrier()?
Yes, vLLM provides its own distributed synchronization utilities. The recommended way is to use the process group abstraction available via get_world_group() from vllm.distributed.parallel_state, which exposes both a device group (NCCL) and a CPU group (GLOO). For control-plane synchronization like barriers, you should use the CPU group (GLOO backend), as using NCCL for barriers is discouraged due to stability issues. You can call get_world_group().cpu_group and then use torch.distributed.barrier(group=cpu_group) for synchronization, or use vLLM’s higher-level barrier utility if available in your context.
Alternatively, vLLM provides a StatelessProcessGroup.barrier() method, which implements a robust, multi-phase barrier for synchronizing all ranks and is designed for metadata/control-plane use. This is preferred over directly calling torch.distributed.barrier() with NCCL, especially in plugin or extension code, to avoid device context issues and deadlocks. See vllm.distributed.utils.StatelessProcessGroup.barrier() and torchrun example for usage.
Would you like a code example or more detail on integrating this into your plugin?
You are correct: using a barrier will force all ranks to wait for the slowest, which can reduce throughput, especially in high-throughput or asynchronous scenarios. vLLM does not provide a more granular or non-blocking synchronization primitive out of the box; the barrier is an all-or-nothing synchronization point designed for correctness, not performance. If you need finer control, you would need to implement custom coordination logic (e.g., using distributed key-value stores or asynchronous notifications), but this is not directly supported by vLLM’s public APIs.
Would you like more detail on possible custom approaches or examples of how to implement non-blocking coordination?
Certainly! One common custom approach is to use a distributed key-value store (like PyTorch’s torch.distributed.TCPStore) to implement a lightweight “all-done” check: each rank writes a key when it finishes, and then periodically checks if all expected keys are present before proceeding. This avoids blocking all ranks at once and allows for asynchronous progress.
Would you like a step-by-step code example for this pattern?