We are benchmarking an external fan-out setup with multiple independent DP=1 vllm serve replicas, one per GPU, and comparing two orchestration strategies in Python:
process-based orchestration
main process → worker processes → vllm serve / vllm bench serve subprocesses
thread-based orchestration
main process → worker threads → vllm serve / vllm bench serve subprocesses
In this setup, the Python workers do not perform inference directly. They only launch and manage subprocesses, wait for readiness, run vllm bench serve, and aggregate results. Actual inference happens inside the vllm serve subprocesses.
What we observed is that thread-based orchestration significantly outperforms process-based orchestration. Our current hypothesis is that this is expected because the outer Python layer is mostly orchestration/I-O/subprocess management, not CPU-bound compute, so multiprocessing adds extra overhead (spawn, imports, IPC, queue serialization, process lifecycle) without much benefit.
Questions:
Does this align with the vLLM team’s expectations for external multi-replica orchestration?
Are there recommended best practices for launching multiple independent vllm serve instances from one controller process?
Would you recommend thread-based orchestration as the default control-plane approach in this type of setup?
Are there any known caveats around startup, cache sharing, or internal multiprocessing interactions that could distort benchmark results in this pattern?
Yes, your observations align with vLLM team expectations: for external orchestration of multiple independent vllm serve (DP=1) replicas, thread-based orchestration is generally preferable to process-based orchestration, since the outer Python layer is I/O-bound and only manages subprocesses, not compute. Multiprocessing adds unnecessary overhead (process spawn, IPC, serialization) in this context, while threading is lighter and sufficient for orchestrating subprocesses and aggregating results. This is consistent with community discussions and recommendations for similar setups, where users often use threads or async for orchestration and reserve processes for actual inference workloads inside vllm serve instances (source, source).
Best practices include: (1) using threads or async in the controller for launching and monitoring vllm serve/bench serve subprocesses, (2) ensuring each vllm serve instance is pinned to a unique GPU (e.g., via CUDA_VISIBLE_DEVICES), (3) avoiding shared cache directories unless explicitly supported, and (4) being aware that vllm serve itself uses multiprocessing internally, so the outer orchestration should not interfere with its process tree. There are no known issues with this orchestration pattern distorting benchmarks, as long as subprocesses are isolated and the controller does not become a bottleneck. For more, see this discussion and multiprocessing docs.
Would you like a step-by-step breakdown or code example?