How to understand custom_all_reduce stage2's comment

Note: it’s important to match the tid between the two stages, because visibility across devices is only guaranteed between threads that have the same tid. If thread i computes the sum of start + i in the first stage, then thread i also gathers start + i from all ranks.

why visibility across devices is only guaranteed between threads that have the same tid?

Visibility across devices is only guaranteed between threads with the same thread ID (tid) because the custom allreduce implementation in vLLM synchronizes data exchange at the granularity of thread indices. Each thread is responsible for a specific data segment, and synchronization primitives (such as barriers and memory fences) ensure that only corresponding threads (same tid) across GPUs see each other’s writes reliably. This design avoids race conditions and ensures correct data transfer, as described in custom_all_reduce.cuh and discussed in PR #8558.

If threads with different tids accessed or wrote to the same data, there would be no guarantee of memory visibility or ordering, potentially leading to stale or inconsistent data due to the lack of global coherence and synchronization between arbitrary threads. Thus, matching tids ensures that each thread’s communication is well-defined and safe across devices, as explained in the code comments and PR discussions.

Would you like a more detailed technical breakdown or code example?

Sources: