class DPLBAsyncMPClient(DPAsyncMPClient):
async def _scale_up_elastic_ep(self, cur_data_parallel_size: int,
new_data_parallel_size: int) -> None:
"""Scale up the data parallel size by creating new engine cores
and reconfiguring existing ones."""
cur_data_parallel_size = len(self.core_engines)
# Phase 1: Send reconfigure messages to all existing engines and wait
# for them to be sent
reconfig_futures = []
self.vllm_config.parallel_config.data_parallel_master_port = \
get_open_port()
for engine in self.core_engines:
reconfig_request = ReconfigureDistributedRequest(
new_data_parallel_size=new_data_parallel_size,
new_data_parallel_rank=ReconfigureRankType.KEEP_CURRENT_RANK,
new_data_parallel_rank_local=\
ReconfigureRankType.KEEP_CURRENT_RANK,
new_data_parallel_master_ip=self.vllm_config.parallel_config.
data_parallel_master_ip,
new_data_parallel_master_port=self.vllm_config.parallel_config.
data_parallel_master_port)
coro = self._call_utility_async("reinitialize_distributed",
reconfig_request,
engine=engine)
reconfig_futures.append(asyncio.create_task(coro))
logger.info("All reconfigure messages sent, starting engine creation")
# Phase 2: Create new engines now that reconfig messages have been sent
# self.resources.engine_manager is guaranteed to be
# CoreEngineActorManager for RayDPClient
assert isinstance(self.resources.engine_manager,
CoreEngineActorManager)
self.resources.engine_manager.scale_up_elastic_ep(
self.vllm_config, new_data_parallel_size)
Phase 1 will let Workers run EPLB
But on Phase 2 the New Engine and Placement Group is created.
When scaling up with _scale_up_elastic_ep, Phase 1 sends reconfiguration messages to all existing engines, allowing them to update their distributed environment and run EPLB (Expert Parallel Load Balancer) to rebalance expert assignments based on the new world size. In Phase 2, new engine cores and placement groups are created for the additional data parallel ranks, and these new engines receive all necessary weights and configuration from the existing engines, ensuring that EPLB’s rebalanced expert mapping is consistently applied across both old and new engines. This two-phase approach ensures that EPLB operates correctly by first updating the state on existing engines before integrating new engines into the cluster, avoiding inconsistencies in expert allocation or weight distribution during the scale-up process, as described in Elastic EP Milestone 2 and core_client.py source.
Would you like a step-by-step breakdown of how EPLB and engine state synchronization are handled during scale-up?
Certainly! Here’s a step-by-step breakdown of how EPLB (Expert Parallel Load Balancer) and engine state synchronization are handled during scale-up in vLLM:
Phase 1: Reconfigure Existing Engines
The controller sends reconfiguration requests to all existing engine cores, instructing them to update their distributed environment to the new data parallel size.
Each engine updates its communicators and prepares for the new world size, but continues to serve requests.
EPLB is triggered to rebalance expert assignments and redistribute expert weights according to the new configuration. This ensures that the expert mapping is consistent and ready for the expanded cluster.
The engines may progressively reconfigure between forward steps, minimizing service interruption and allowing in-flight requests to complete (Elastic EP Milestone 2).
Phase 2: Create New Engines and Placement Groups
After all reconfiguration messages are sent, new engine cores are created for the additional data parallel ranks.
The new engines receive all necessary model weights (including both non-MoE and expert weights) from peer GPUs, ensuring they have the same expert mapping and state as the existing engines.
The system waits for the new engines to signal readiness before proceeding.
Once all engines (old and new) are ready, the coordinator finalizes the scale-up, and the updated EPLB mapping is now active cluster-wide (core_client.py source, Elastic EP Milestone 2).
This two-phase process ensures that EPLB’s expert rebalancing and all engine states are synchronized before and after new engines join, preventing inconsistencies or traffic drops.
@RunLLM just tell me when is the real mapping changed. what will happen in phase 1 when i changed the mapping but new engine and gpu bot transferred. use specific data to explain
The real expert mapping is changed during Phase 1, when existing engines receive the reconfiguration command and EPLB runs to rebalance and update the expert-to-rank assignments. However, at this point, only the existing engines have the new mapping and weights; the new engines (and their GPUs) have not yet received any expert weights or mapping.
If you change the mapping in Phase 1 but the new engines have not yet been created or transferred weights, the new engines will initially have no valid expert weights or mapping. They will receive the correct expert weights and mapping only in Phase 2, when they are created and the system synchronizes their state with the existing engines (Elastic EP Milestone 2, core_client.py source). This ensures that, before the new engines start serving requests, they are fully consistent with the updated mapping.
Would you like a concrete example with numbers to illustrate this process?
Existing requests continue to be served by the current engines using the old mapping until the new ranks are fully ready. The system allows existing workers to progressively reconfigure between forward steps, so if reconfiguration cannot continue (e.g., because new engines are not yet ready), the current engines keep serving requests without interruption (Elastic EP Milestone 2).
No requests are routed to the new ranks until they are fully initialized and synchronized; only after all engines are ready does the system switch to the new mapping and include the new ranks in request serving (Elastic EP Milestone 2).
Would you like more detail on the synchronization or request routing logic?
This guarantee is enforced by the scale-up state machine and synchronization logic: after sending reconfiguration messages, the system waits for all new engines to signal readiness before proceeding. Specifically, in _scale_up_elastic_ep, after creating new engines, the code enters a loop that waits for “ready” messages from all new engines on the input socket before moving forward. Only after all new engines have sent their initial message does the process continue to finalize the scale-up (core_client.py source).
Here is the relevant code excerpt:
# Wait for ready messages from new engines on the input socket
sync_input_socket = zmq.Socket.shadow(self.input_socket)
while new_engine_identities:
if not sync_input_socket.poll(timeout=600_000):
raise TimeoutError(
"Timed out waiting for new engines to send initial "
"message on input socket."
)
identity, _ = sync_input_socket.recv_multipart()
new_engine_identities.discard(identity)
This loop ensures that no further steps (including request routing to new ranks) occur until all new engines are fully initialized and have communicated readiness.
Correct: if the EPLB state (expert mapping) is changed on existing engines before new engines are ready, those new engines do not yet have the updated weights or mapping, so they cannot serve requests safely. However, vLLM’s scale-up logic ensures that only the existing engines serve requests during this transition—no requests are routed to new engines until they are fully initialized and synchronized (Elastic EP Milestone 2). Thus, the temporary mismatch does not affect correctness, because new engines are not in the serving pool until the mapping and weights are consistent everywhere.
Would you like more detail on how vLLM prevents new engines from serving requests prematurely?
@RunLLM vLLM’s scale-up logic ensures that only the existing engines serve requests during this transition—no requests are routed to new engines until they are fully initialized and synchronized (Elastic EP Milestone 2)
This is ensured because, during scale-up, the new engines are not added to the active engine list or request routing pool until they have sent their “ready” signal and the scale-up process is finalized. The main request scheduling and routing logic only considers engines present in self.core_engines, and the new engines are appended to this list only after they are fully initialized and have communicated readiness (core_client.py source). Until this point, only the original engines are available for request dispatch.
If you want to see the exact code, the following snippet shows that new engines are appended and only become eligible after readiness:
for i in range(cur_data_parallel_size, new_data_parallel_size):
new_engine = i.to_bytes(2, "little")
self.core_engines.append(new_engine)
new_engine_identities.add(new_engine)
# Wait for ready messages from new engines on the input socket
sync_input_socket = zmq.Socket.shadow(self.input_socket)
while new_engine_identities:
if not sync_input_socket.poll(timeout=600_000):
raise TimeoutError(...)
identity, _ = sync_input_socket.recv_multipart()
new_engine_identities.discard(identity)
Request routing does not use these new engines until this loop completes, so no requests are sent to them prematurely.
No—during the transition, only the existing engines serve requests, and they use the updated expert mapping locally. Since new engines are not yet in the routing pool, no requests are sent to them, so there is no risk of requests being routed to an uninitialized or incorrect expert location (Elastic EP Milestone 2). The system ensures that all engines (old and new) have the same, updated mapping before the new engines begin serving.
Would you like a step-by-step example of how requests are handled during this window?
During the scale-up transition, the updated expert mapping may assign some experts to new ranks that do not yet exist. However, requests are only processed by the existing engines, and any expert assignments to new (not-yet-initialized) ranks are effectively ignored until those ranks are ready. The routing logic on existing engines will only dispatch to experts present on their own device or among currently active ranks, so no request is routed to a non-existent rank during this window (Elastic EP Milestone 2).
If an expert is mapped to a new rank that is not yet available, the existing engines will not attempt to send data to that rank until it is fully initialized and synchronized. This prevents routing errors or failures during the transition.
You are correct: if the new expert mapping assigns some experts exclusively to new (not-yet-initialized) ranks, then during the transition, existing engines may not have access to those experts. This can result in missing experts or degraded service until all new ranks are fully initialized and the mapping is consistent cluster-wide. This is a known limitation: the system relies on the mapping and weight transfer to complete before requests requiring those experts are routed, and during the transition, some requests may be delayed or served with degraded capacity (Elastic EP Milestone 2).
Would you like more detail on how to mitigate this or how vLLM handles such edge cases?