I’d like to do simple serving with vllm, but without manual managing/launching multiple engines / CUDA_VISIBLE_DEVICES allocation.
Does vllm serve support a mode which can utilize all the available GPU’s? (e.g. it should dispatch the incoming requests to multiple engines for minimize latency)? Would `–data-parallel-size 4 --tensor-parallel-size 2` achieve what I’m looking for? (launch 4 engines, each using 2 GPU’s) Or does it do something else?
Basically looking for a basic single-node, 8 GPU serving solution without having to spin up any load-balancers / schedulers / relaunchers (if any crashes happen)
Yes, vllm serve now supports internal data parallelism and can utilize all available GPUs without manual engine management or CUDA_VISIBLE_DEVICES allocation. Using --data-parallel-size 4 --tensor-parallel-size 2 will launch 4 data-parallel engines, each using 2 GPUs (via tensor parallelism), for a total of 8 GPUs. Incoming requests are automatically load-balanced across the engines to minimize latency, and vLLM manages process restarts if any engine crashes, so you do not need an external load balancer or relauncher. This is the recommended way for simple, single-node, multi-GPU serving with vLLM[^1][^2][^3].
This exposes a single API endpoint and internally handles request dispatch and engine management across all 8 GPUs. Would you like more detail on how the internal load balancing or failure recovery works?
I’m getting this log when using vllm serve with --data-parallel-size:
INFO 08-01 12:18:31 [config.py:1869] Defaulting to use mp for distributed inference
INFO 08-01 12:18:31 [config.py:2112] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 08-01 12:18:31 [cuda.py:156] Data Parallel: Forcing enforce eager to be True since DP is currently not supported with CUDA Graphs.
Why are torch.compile / CUDA graphs not supported for this mode? Ideally vllm should even cache / share the same compilation artifacts in all engines (same as MEGA cache for PyTorch - and certainly the compilation artifacts / cache can be shared via the file system).
(I’m a human. Looking for human responses, thanks)
Currently, vLLM disables CUDA graphs and torch.compile when using data parallelism (--data-parallel-size) because CUDA graphs are not yet supported in this mode. This is due to technical limitations: CUDA graphs require static execution and memory management, but data parallel engines are separate processes, making it difficult to safely share or reuse compiled graphs and their memory across processes. As a result, vLLM enforces eager mode for stability and correctness in DP setups[^1][^2].
Sharing compilation artifacts (like torch.compile graphs) across engines is a desired feature, but it’s non-trivial due to process isolation and device context differences. The vLLM team is aware of this limitation and may address it in future releases, but for now, eager mode is required for data parallel serving. Would you like more technical detail on the challenges or current roadmap?
Understood—here’s a direct, human-style answer:
vLLM disables CUDA graphs and torch.compile in data parallel mode because each data-parallel engine is a separate process, and CUDA graphs (and their compiled artifacts) are tightly bound to the process and device context that created them. Sharing or reusing these graphs across processes is technically complex and not yet implemented, so to avoid subtle bugs or crashes, vLLM enforces eager mode for data parallel serving. This is a known limitation, and the team is considering improvements, but for now, it’s a tradeoff for reliability and simplicity in multi-process setups[^1][^2].
If you want maximum performance with CUDA graphs or torch.compile, use tensor parallelism only (i.e., --tensor-parallel-size N), which works within a single process group. Data parallelism is best for maximizing throughput and fault tolerance, but currently at the cost of these optimizations. Let me know if you want more technical background or have other questions.