My target is to calculate all the communication time in decode step.
A.This type of nccDevKernel_Generic don’t show the collective communication type.
{
“ph”: “X”, “cat”: “cuda_runtime”, “name”: “hipExtLaunchKernel”, “pid”: 535, “tid”: 535,
“ts”: 5051560851217.906, “dur”: 5.058,
“args”: {
“External id”: 162, “kernel”: “ncclDevKernel_Generic(ncclDevComm*, channelMasks, ncclWork*)”, “cid”: 62, “correlation”: 79, “grid”: [1, 1, 1], “block”: [256, 1, 1], “shared memory”: 0
}
}
B.This type of nccDevKernel_Generic show the collective communication type.
{
“ph”: “X”, “cat”: “kernel”, “name”: “ncclDevKernel_Generic(ncclDevComm*, channelMasks, ncclWork*)”, “pid”: 2, “tid”: 12,
“ts”: 5051589368012.054, “dur”: 16.624,
“args”: {
“External id”: 20271, “device”: 2, “stream”: 12, “correlation”: 37912, “kind”: “Dispatch Kernel”, “grid”: [1, 1, 1], “block”: [256, 1, 1], “Collective name”: “broadcast”, “In msg nelems”: 4001, “Out msg nelems”: 4001, “Group size”: 8, “dtype”: “Int”, “In split size”: “”, “Out split size”: “”, “Process Group Name”: “3”, “Process Group Description”: “undefined”, “Process Group Ranks”: “[0, 1, 2, 3, 4, 5, 6, 7]”
}
},
