What is the recommended max tensor parallel value if GPUs are in same node? Does after some amount of gpus cause slowness etc?
The recommended tensor parallel size is the number of GPUs in your node, but only if your model is large enough to require splitting across all those GPUs. For small or medium models, increasing tensor_parallel_size beyond what is needed can actually reduce performance due to increased inter-GPU communication overhead—often, using more GPUs than necessary for a model will slow things down rather than speed them up. For large models that cannot fit on a single GPU, set tensor_parallel_size to the minimum number of GPUs required to fit the model, and only increase further if you need more KV cache memory or throughput, but be aware of diminishing returns and possible slowness from communication overhead as you add more GPUs. This is especially true if the GPUs are not connected via high-speed interconnects like NVLink or InfiniBand, as PCIe-only setups will see more overhead with higher tensor parallel sizes. For small models, data parallelism (multiple single-GPU instances) is often more efficient than high tensor parallelism.
In summary: use the smallest tensor_parallel_size that fits your model and only scale up if needed for memory or throughput, monitoring for performance drops as you add more GPUs. See Optimization and Tuning, Distributed Inference and Serving, and Performance Tuning for details.
Would you like a step-by-step example or more detail on how to choose the value for your specific model and hardware?
Sources:
So if the goal is to get more tokens/s faster inference, then is data parallel better if model fits to half of the GPUs?
Yes, if your model fits on a single GPU (or on half your GPUs), running multiple independent vLLM instances (data parallelism) is generally better for maximizing tokens/s and throughput than using tensor parallelism across all GPUs. Tensor parallelism introduces inter-GPU communication overhead, which can slow down inference, especially for small or medium models. Data parallelism allows each GPU to process separate batches independently, scaling throughput nearly linearly with the number of GPUs used, as long as each GPU has a full model copy and enough memory for the batch and KV cache.
This approach is recommended in the vLLM docs and user discussions: use data parallelism (multiple model replicas, each on one GPU) for higher throughput when the model fits on a single GPU, and reserve tensor parallelism for models too large for a single GPU. See Optimization and Tuning, Tensor Parallelism vs Data Parallelism, and Data-parallel inference.
Would you like a code example or more detail on how to set up data parallel inference?
Sources:
Yes, how do I serve data parallel on 2 7900 xtx.
I noticed that running gemma3 on tensor parallel 1 is much slower that tensor parallel 2. How should data parallel do?
To serve data parallel inference on 2 AMD 7900 XTX GPUs, run two separate vLLM server instances, each with --tensor-parallel-size 1 and each bound to a different GPU (using CUDA_VISIBLE_DEVICES or the ROCm equivalent). Then, use a load balancer to distribute requests between the two servers. This way, each GPU runs an independent copy of the model, maximizing throughput. Data parallelism in vLLM is achieved by running multiple independent servers, not by a built-in flag for AMD GPUs as of now.
If you run with --tensor-parallel-size 2, the model is split across both GPUs, which can be faster for large models that don’t fit on one GPU, but for models that fit on a single GPU, data parallel (multiple servers, each on one GPU) is usually faster. See Data-parallel inference and Tensor Parallelism vs Data Parallelism.
Would you like a step-by-step code example for launching and load balancing two vLLM servers?
Sources:
Okey so if having 55gb size model, getting more than 4 24gb cards wont much increase troughput in tensor parallel
Correct—using more than 4×24GB GPUs for a 55GB model with tensor parallelism will not significantly increase throughput and may even reduce it due to increased inter-GPU communication overhead. Once the model fits across 3 or 4 GPUs, adding more GPUs mainly adds communication cost without much benefit, especially for inference workloads. For best throughput, use the minimum number of GPUs needed to fit the model, and consider data parallelism (multiple replicas) if you want to scale further for serving more requests in parallel. See Tensor Parallelism vs Data Parallelism and Optimization and Tuning.
Would you like more detail or a setup example?
Sources:
Will tensor parallel work with odd number of cards?
Tensor parallelism in vLLM only works if the number of attention heads and other key model dimensions are divisible by the tensor parallel size (number of GPUs). If your model’s number of attention heads is not divisible by the number of GPUs (e.g., 3 or 5), tensor parallel will not work and vLLM will raise an error. This is a fundamental limitation of the current implementation. See vLLM Issue #596, vLLM Issue #1208, and vLLM Issue #5003.
Would you like more detail or possible workarounds?
Sources:
Yes, how it is best to setup data parallelism with a haproxy? Does it need sticky cookies
It seems likely that the best way to set up data parallelism with HAProxy is to run a separate vLLM server on each GPU (or node), then use HAProxy as a reverse proxy to load balance requests across these servers. Sticky (session) cookies are not strictly required for stateless inference requests, but if you want to maximize cache hits (e.g., for prefix caching or conversational context), enabling sticky sessions (using cookies or source IP) can help route repeated requests from the same client to the same backend. See Data-parallel inference and Tensor Parallelism vs Data Parallelism.
Would you like a sample HAProxy configuration or more detail on sticky session setup?
Sources: