EPLB behavior in elastic scaling

xeonliu · November 28, 2025, 9:27am

class DPLBAsyncMPClient(DPAsyncMPClient):
    async def _scale_up_elastic_ep(self, cur_data_parallel_size: int,
                                   new_data_parallel_size: int) -> None:
        """Scale up the data parallel size by creating new engine cores
        and reconfiguring existing ones."""
        cur_data_parallel_size = len(self.core_engines)

        # Phase 1: Send reconfigure messages to all existing engines and wait
        # for them to be sent
        reconfig_futures = []
        self.vllm_config.parallel_config.data_parallel_master_port = \
            get_open_port()
        for engine in self.core_engines:
            reconfig_request = ReconfigureDistributedRequest(
                new_data_parallel_size=new_data_parallel_size,
                new_data_parallel_rank=ReconfigureRankType.KEEP_CURRENT_RANK,
                new_data_parallel_rank_local=\
                ReconfigureRankType.KEEP_CURRENT_RANK,
                new_data_parallel_master_ip=self.vllm_config.parallel_config.
                data_parallel_master_ip,
                new_data_parallel_master_port=self.vllm_config.parallel_config.
                data_parallel_master_port)
            coro = self._call_utility_async("reinitialize_distributed",
                                            reconfig_request,
                                            engine=engine)
            reconfig_futures.append(asyncio.create_task(coro))

        logger.info("All reconfigure messages sent, starting engine creation")

        # Phase 2: Create new engines now that reconfig messages have been sent
        # self.resources.engine_manager is guaranteed to be
        # CoreEngineActorManager for RayDPClient
        assert isinstance(self.resources.engine_manager,
                          CoreEngineActorManager)
        self.resources.engine_manager.scale_up_elastic_ep(
            self.vllm_config, new_data_parallel_size)

Phase 1 will let Workers run EPLB

But on Phase 2 the New Engine and Placement Group is created.

How is EPLB expected to behave correctly?

RunLLM · November 28, 2025, 9:27am

When scaling up with _scale_up_elastic_ep, Phase 1 sends reconfiguration messages to all existing engines, allowing them to update their distributed environment and run EPLB (Expert Parallel Load Balancer) to rebalance expert assignments based on the new world size. In Phase 2, new engine cores and placement groups are created for the additional data parallel ranks, and these new engines receive all necessary weights and configuration from the existing engines, ensuring that EPLB’s rebalanced expert mapping is consistently applied across both old and new engines. This two-phase approach ensures that EPLB operates correctly by first updating the state on existing engines before integrating new engines into the cluster, avoiding inconsistencies in expert allocation or weight distribution during the scale-up process, as described in Elastic EP Milestone 2 and core_client.py source.

Would you like a step-by-step breakdown of how EPLB and engine state synchronization are handled during scale-up?

Sources:

Hint: Mention @RunLLM in the post for followups.

xeonliu · November 28, 2025, 9:29am

@RunLLM I would like a step-by-step breakdown of how EPLB and engine state synchronization are handled during scale-up

RunLLM · November 28, 2025, 9:29am

Certainly! Here’s a step-by-step breakdown of how EPLB (Expert Parallel Load Balancer) and engine state synchronization are handled during scale-up in vLLM:

Phase 1: Reconfigure Existing Engines
- The controller sends reconfiguration requests to all existing engine cores, instructing them to update their distributed environment to the new data parallel size.
- Each engine updates its communicators and prepares for the new world size, but continues to serve requests.
- EPLB is triggered to rebalance expert assignments and redistribute expert weights according to the new configuration. This ensures that the expert mapping is consistent and ready for the expanded cluster.
- The engines may progressively reconfigure between forward steps, minimizing service interruption and allowing in-flight requests to complete (Elastic EP Milestone 2).
Phase 2: Create New Engines and Placement Groups
- After all reconfiguration messages are sent, new engine cores are created for the additional data parallel ranks.
- The new engines receive all necessary model weights (including both non-MoE and expert weights) from peer GPUs, ensuring they have the same expert mapping and state as the existing engines.
- The system waits for the new engines to signal readiness before proceeding.
- Once all engines (old and new) are ready, the coordinator finalizes the scale-up, and the updated EPLB mapping is now active cluster-wide (core_client.py source, Elastic EP Milestone 2).

This two-phase process ensures that EPLB’s expert rebalancing and all engine states are synchronized before and after new engines join, preventing inconsistencies or traffic drops.

Sources:

Hint: Mention @RunLLM in the post for followups.

xeonliu · November 28, 2025, 9:57am

@RunLLM just tell me when is the real mapping changed. what will happen in phase 1 when i changed the mapping but new engine and gpu bot transferred. use specific data to explain

RunLLM · November 28, 2025, 9:57am

The real expert mapping is changed during Phase 1, when existing engines receive the reconfiguration command and EPLB runs to rebalance and update the expert-to-rank assignments. However, at this point, only the existing engines have the new mapping and weights; the new engines (and their GPUs) have not yet received any expert weights or mapping.

If you change the mapping in Phase 1 but the new engines have not yet been created or transferred weights, the new engines will initially have no valid expert weights or mapping. They will receive the correct expert weights and mapping only in Phase 2, when they are created and the system synchronizes their state with the existing engines (Elastic EP Milestone 2, core_client.py source). This ensures that, before the new engines start serving requests, they are fully consistent with the updated mapping.

Would you like a concrete example with numbers to illustrate this process?

Sources:

Hint: Mention @RunLLM in the post for followups.

xeonliu · November 28, 2025, 9:58am

@RunLLM but what will happen to existing requests when the new ranks are not ready

RunLLM · November 28, 2025, 9:58am

Existing requests continue to be served by the current engines using the old mapping until the new ranks are fully ready. The system allows existing workers to progressively reconfigure between forward steps, so if reconfiguration cannot continue (e.g., because new engines are not yet ready), the current engines keep serving requests without interruption (Elastic EP Milestone 2).

No requests are routed to the new ranks until they are fully initialized and synchronized; only after all engines are ready does the system switch to the new mapping and include the new ranks in request serving (Elastic EP Milestone 2).

Would you like more detail on the synchronization or request routing logic?

Sources:

Elastic EP Milestone 2

Hint: Mention @RunLLM in the post for followups.

xeonliu · November 28, 2025, 9:59am

@RunLLM No requests are routed to the new ranks until they are fully initialized and synchronized;

how is this guarenteed? show me the exact code.

RunLLM · November 28, 2025, 10:00am

This guarantee is enforced by the scale-up state machine and synchronization logic: after sending reconfiguration messages, the system waits for all new engines to signal readiness before proceeding. Specifically, in _scale_up_elastic_ep, after creating new engines, the code enters a loop that waits for “ready” messages from all new engines on the input socket before moving forward. Only after all new engines have sent their initial message does the process continue to finalize the scale-up (core_client.py source).

Here is the relevant code excerpt:

# Wait for ready messages from new engines on the input socket
sync_input_socket = zmq.Socket.shadow(self.input_socket)
while new_engine_identities:
    if not sync_input_socket.poll(timeout=600_000):
        raise TimeoutError(
            "Timed out waiting for new engines to send initial "
            "message on input socket."
        )
    identity, _ = sync_input_socket.recv_multipart()
    new_engine_identities.discard(identity)

This loop ensures that no further steps (including request routing to new ranks) occur until all new engines are fully initialized and have communicated readiness.

Sources:

core_client.py source

Hint: Mention @RunLLM in the post for followups.

xeonliu · November 28, 2025, 10:02am

@RunLLM but eplb state is already changed. it won’t do any good, isn’t it?

RunLLM · November 28, 2025, 10:02am

Correct: if the EPLB state (expert mapping) is changed on existing engines before new engines are ready, those new engines do not yet have the updated weights or mapping, so they cannot serve requests safely. However, vLLM’s scale-up logic ensures that only the existing engines serve requests during this transition—no requests are routed to new engines until they are fully initialized and synchronized (Elastic EP Milestone 2). Thus, the temporary mismatch does not affect correctness, because new engines are not in the serving pool until the mapping and weights are consistent everywhere.

Would you like more detail on how vLLM prevents new engines from serving requests prematurely?

Sources:

Elastic EP Milestone 2

Hint: Mention @RunLLM in the post for followups.

xeonliu · November 28, 2025, 10:04am

@RunLLM vLLM’s scale-up logic ensures that only the existing engines serve requests during this transition—no requests are routed to new engines until they are fully initialized and synchronized (Elastic EP Milestone 2)

I dont’ see why.

RunLLM · November 28, 2025, 10:05am

This is ensured because, during scale-up, the new engines are not added to the active engine list or request routing pool until they have sent their “ready” signal and the scale-up process is finalized. The main request scheduling and routing logic only considers engines present in self.core_engines, and the new engines are appended to this list only after they are fully initialized and have communicated readiness (core_client.py source). Until this point, only the original engines are available for request dispatch.

If you want to see the exact code, the following snippet shows that new engines are appended and only become eligible after readiness:

for i in range(cur_data_parallel_size, new_data_parallel_size):
    new_engine = i.to_bytes(2, "little")
    self.core_engines.append(new_engine)
    new_engine_identities.add(new_engine)

# Wait for ready messages from new engines on the input socket
sync_input_socket = zmq.Socket.shadow(self.input_socket)
while new_engine_identities:
    if not sync_input_socket.poll(timeout=600_000):
        raise TimeoutError(...)
    identity, _ = sync_input_socket.recv_multipart()
    new_engine_identities.discard(identity)

Request routing does not use these new engines until this loop completes, so no requests are sent to them prematurely.

Sources:

core_client.py source

Hint: Mention @RunLLM in the post for followups.

xeonliu · November 28, 2025, 10:06am

@RunLLM But the expert mapping is already changed wont they be routed to wrong places??

RunLLM · November 28, 2025, 10:06am

No—during the transition, only the existing engines serve requests, and they use the updated expert mapping locally. Since new engines are not yet in the routing pool, no requests are sent to them, so there is no risk of requests being routed to an uninitialized or incorrect expert location (Elastic EP Milestone 2). The system ensures that all engines (old and new) have the same, updated mapping before the new engines begin serving.

Would you like a step-by-step example of how requests are handled during this window?

Sources:

Elastic EP Milestone 2

Hint: Mention @RunLLM in the post for followups.

xeonliu · November 28, 2025, 10:08am

@RunLLM but expert is set to route to non existing new ranks

RunLLM · November 28, 2025, 10:08am

During the scale-up transition, the updated expert mapping may assign some experts to new ranks that do not yet exist. However, requests are only processed by the existing engines, and any expert assignments to new (not-yet-initialized) ranks are effectively ignored until those ranks are ready. The routing logic on existing engines will only dispatch to experts present on their own device or among currently active ranks, so no request is routed to a non-existent rank during this window (Elastic EP Milestone 2).

If an expert is mapped to a new rank that is not yet available, the existing engines will not attempt to send data to that rank until it is fully initialized and synchronized. This prevents routing errors or failures during the transition.

Sources:

Elastic EP Milestone 2

Hint: Mention @RunLLM in the post for followups.

xeonliu · November 28, 2025, 10:09am

@RunLLM even if the existing engines will not attempt to send data, but the correct expert can not be found

RunLLM · November 28, 2025, 10:10am

You are correct: if the new expert mapping assigns some experts exclusively to new (not-yet-initialized) ranks, then during the transition, existing engines may not have access to those experts. This can result in missing experts or degraded service until all new ranks are fully initialized and the mapping is consistent cluster-wide. This is a known limitation: the system relies on the mapping and weight transfer to complete before requests requiring those experts are routed, and during the transition, some requests may be delayed or served with degraded capacity (Elastic EP Milestone 2).

Would you like more detail on how to mitigate this or how vLLM handles such edge cases?

Sources:

Elastic EP Milestone 2

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
[Question] Confirmation on Physical Data Movement and Runtime Dynamics in rearrange_expert_weights_inplace General	5	80	November 24, 2025
为什么ep的 size=tp*dp General	8	181	November 25, 2025
How to configure the parameters of EPLB？ General	3	370	August 18, 2025
Intergate async-llm to openrlfh, when tensor_parallel_size > 1 will cause ray-error OpenRLHF	1	231	April 27, 2025
Some question abound Data Parallel examples General	7	567	June 21, 2025

EPLB behavior in elastic scaling

Related topics