Data parallel or tensor parallel

Hi,

I have 2x 32GB GPUs. Currently they are in data parallel, both are separate servers behind haproxy load balancer. Both server 12B gemma3 ans 2048 context lenght and 64 concurrent users.
If i put these in same machine and use tensor-parallel = 2. The model weights are then shared between GPUs, so there will be left more memory in each GPU for kv cache? Or is the KV cache duplicated on each GPU? So which is better, data parallel or tensor parallel?

If you use tensor parallelism (tensor_parallel_size=2), the model weights are split across both GPUs, so each GPU holds half the weights, leaving more memory for KV cache per GPU. In this setup, the KV cache is also sharded (not duplicated), allowing you to serve longer contexts or more concurrent users per instance. In contrast, with data parallelism (separate servers), each GPU holds a full copy of the model and its own independent KV cache, so memory is not shared and you cannot serve longer contexts per user, but you can scale throughput linearly by load balancing requests across servers.

For a single large model and maximizing context length or concurrency per instance, tensor parallelism is better. For maximizing total throughput with many independent requests and smaller models, data parallelism (multiple servers) is often more efficient due to lower inter-GPU communication overhead. For your 12B model and 2x32GB GPUs, tensor parallelism will allow higher per-instance concurrency and context length, while data parallelism will maximize aggregate throughput if each instance is sufficient for your workload. See: Parallelism and Scaling, Optimization and Tuning, Data Parallel Deployment.

Would you like a step-by-step breakdown or code examples for configuring either approach?

Sources:

thanks, some othe AI was saying KV cache is replicated on each GPU in tensor parallel, so good to know it was wrong.

Correct—when using tensor parallelism in vLLM, the KV cache is sharded across GPUs, not replicated. Each GPU stores only the portion of the KV cache corresponding to the tokens and model weights it handles, which is why tensor parallelism frees up more memory for longer context or higher concurrency compared to data parallelism, where each GPU has a full copy of both model and KV cache (Optimization and Tuning, Conserving Memory).

Would you like more technical details or references on how KV cache sharding works in vLLM?

Sources: