I am reviewing the logic behind rearrange_expert_weights_inplace and I have some confusion regarding the actual behavior of weight migration during scaling (e.g., changing EP world size).
Could you please clarify the following two points?
1. Is the weight copying “Physical”?
Despite the function being named _inplace (which usually implies zero-copy or metadata-only changes), my strict understanding is that when new_global_expert_indices differs from old_global_expert_indices, actual heavy payload transfer must occur.
Does this trigger NCCL communication to move expert weights across GPUs?
Does this trigger Memory Copy to reconstruct the expert_weights tensor on the local GPU?
I want to confirm that this is not just a logical index re-mapping, but a physical migration of the underlying Tensor data.
2. Is this intended for “Dynamic” Runtime Reconfiguration?
Is this function designed to be called on-the-fly (e.g., between inference steps or requests) to dynamically change the topology of the cluster without restarting the engine?
If so, I assume we need to account for the latency cost of moving these heavy weights.
I would appreciate it if you could confirm my understanding. Thanks!
Yes, the weight copying in rearrange_expert_weights_inplace is physical. When new_global_expert_indices differs from old_global_expert_indices, the function performs actual data movement: it copies expert weights between GPUs using NCCL-based P2P communication and local memory copies, not just logical remapping. This means heavy payload transfer occurs across devices, and the local expert_weights tensors are updated with the new data accordingly. The process involves both inter-GPU (NCCL) and intra-GPU (memory copy) operations, as detailed in the implementation and documentation (API docs, source code).
Yes, this function is intended for dynamic, runtime reconfiguration. It is designed to be called on-the-fly, such as during expert parallel scaling or load balancing, without requiring an engine restart. However, you are correct that this incurs nontrivial latency due to the physical transfer of large tensors, and this cost should be considered when triggering dynamic reconfiguration (API docs, source code).
Would you like a step-by-step breakdown of the communication and memory copy logic?
Certainly! Here’s a step-by-step breakdown of how rearrange_expert_weights_inplace physically migrates expert weights during scaling or rebalancing:
Preparation: The function receives the old and new expert index mappings, the expert weights, and the expert parallel process group. It determines which weights need to be moved, copied locally, or left unchanged (API docs, source code).
Local Copy: For experts that remain on the same GPU and position, their weights are copied locally within the GPU to a buffer—no inter-GPU communication is needed.
P2P Send/Recv Setup: For experts that move to a different GPU, the function determines which ranks (GPUs) need to send or receive each expert’s weights. It sets up NCCL P2P operations (isend/irecv) for these transfers.
Communication: All required P2P operations are batched and executed using batch_isend_irecv, which triggers actual NCCL-based data transfer between GPUs.
Buffer Copy-Back: After communication, the received weights in the buffer are copied back into the local expert_weights tensors, finalizing the migration.
Redundancy and Edge Cases: The logic also handles redundant experts and ensures that, if multiple physical experts on the same rank represent the same logical expert, the data is duplicated locally as needed.
This process ensures that after the function completes, each GPU has the correct expert weights in the right positions, reflecting the new expert mapping. The operation is physical and can be triggered dynamically at runtime, but incurs communication and memory copy overhead (API docs, source code).