Why the nccl communication type is miss in Pytorch profiler when I catch the decode in llama3

My target is to calculate all the communication time in decode step.

A.This type of nccDevKernel_Generic don’t show the collective communication type.

{
“ph”: “X”, “cat”: “cuda_runtime”, “name”: “hipExtLaunchKernel”, “pid”: 535, “tid”: 535,
“ts”: 5051560851217.906, “dur”: 5.058,
“args”: {
“External id”: 162, “kernel”: “ncclDevKernel_Generic(ncclDevComm*, channelMasks, ncclWork*)”, “cid”: 62, “correlation”: 79, “grid”: [1, 1, 1], “block”: [256, 1, 1], “shared memory”: 0
}
}

B.This type of nccDevKernel_Generic show the collective communication type.
{
“ph”: “X”, “cat”: “kernel”, “name”: “ncclDevKernel_Generic(ncclDevComm*, channelMasks, ncclWork*)”, “pid”: 2, “tid”: 12,
“ts”: 5051589368012.054, “dur”: 16.624,
“args”: {
“External id”: 20271, “device”: 2, “stream”: 12, “correlation”: 37912, “kind”: “Dispatch Kernel”, “grid”: [1, 1, 1], “block”: [256, 1, 1], “Collective name”: “broadcast”, “In msg nelems”: 4001, “Out msg nelems”: 4001, “Group size”: 8, “dtype”: “Int”, “In split size”: “”, “Out split size”: “”, “Process Group Name”: “3”, “Process Group Description”: “undefined”, “Process Group Ranks”: “[0, 1, 2, 3, 4, 5, 6, 7]”
}
},

It seems likely that vLLM does not provide a built-in method to directly map all NCCL kernel launches (like ncclDevKernel_Generic) to their specific collective communication types during the decode step, especially when the trace does not include the “Collective name” field. The NCCL wrapper in vLLM exposes functions for all_reduce, all_gather, broadcast, send, recv, and reduce_scatter, but the trace output you provided is lower-level and may not always include the collective type in its metadata, depending on how the kernel is launched or traced. This limitation is not explicitly addressed in the vLLM documentation or codebase, so you may need to rely on external profiling tools or custom instrumentation to correlate kernel launches with high-level collectives during decode. See pynccl_wrapper API for the available NCCL operations.

Would you like more detail on how to instrument or profile these communications in vLLM, or on how vLLM’s decode step interacts with NCCL collectives?

Sources:

I use AMD GPU to test and catch the pytorch profile.

I have added some record_function in pytorch/log and try to what happen in pynccl_wrapper already, I guess it is high level. But it still not show the collective name.

vLLM’s pynccl_wrapper and PyTorch’s record_function only provide high-level hooks and do not expose the NCCL collective name (e.g., all_reduce, broadcast) in the kernel-level trace when using AMD GPUs (ROCm/RCCL). The collective name is not included in the trace output by default, and this limitation is not addressed in vLLM or PyTorch documentation. You may need to use ROCm profiling tools (like rocprof or rocTracer) and ensure your profiling configuration captures the “Collective name” field, if supported by the ROCm stack. This is not something vLLM or PyTorch can control directly at the Python level. See pynccl_wrapper API.

Would you like more detail on ROCm profiling or on how vLLM calls RCCL/NCCL collectives?

Sources:

But I already added some record_function and print before calling the broadcast in distributed folder, and try to tracing in the high levet to see when happen broadcast, but it failed. Maybe there are other place to call broadcast so that I don’t add in the right place.

You are correct: in vLLM, collective communication (like broadcast) can be called from multiple places, not just a single function. The main entry point for broadcasting tensor dictionaries is typically broadcast_tensor_dict in vllm.distributed.communication_op, but this function may be called indirectly by various worker or model runner classes, depending on the execution path and hardware backend. If you only instrumented one location, you might miss other calls, especially since speculative decoding, pipeline parallelism, or other features may trigger broadcasts elsewhere. This is not fully documented in vLLM, but the codebase and issues confirm that broadcast calls can be scattered and sometimes consolidated during refactors (see this discussion).

Would you like a step-by-step guide on how to comprehensively instrument all broadcast calls in vLLM?

Sources:

As this profiler capture, I never know which communication in this file. But it also ncclDevKernel_Generic.

It seems likely that, as your profiler shows, the kernel name ncclDevKernel_Generic appears for all NCCL/RCCL collectives, and only sometimes includes the “Collective name” (e.g., “broadcast”) in the arguments. This is a limitation of the kernel-level tracing: not every ncclDevKernel_Generic launch is annotated with the specific collective type, especially on AMD/ROCm or depending on the profiler/tool version. vLLM and PyTorch do not add extra metadata at the kernel level, so you cannot always map these kernels to high-level communication calls just from the profiler output. This limitation is not addressed in vLLM’s documentation or codebase, and is a known challenge in distributed profiling (see related vLLM issue).

Would you like more detail on alternative strategies to correlate high-level communication calls with kernel launches, or on how to improve profiling for collective operations?

Sources: