Data parallel or tensor parallel

youlearnit · September 14, 2025, 7:53pm

Hi,

I have 2x 32GB GPUs. Currently they are in data parallel, both are separate servers behind haproxy load balancer. Both server 12B gemma3 ans 2048 context lenght and 64 concurrent users.
If i put these in same machine and use tensor-parallel = 2. The model weights are then shared between GPUs, so there will be left more memory in each GPU for kv cache? Or is the KV cache duplicated on each GPU? So which is better, data parallel or tensor parallel?

RunLLM · September 14, 2025, 7:54pm

If you use tensor parallelism (tensor_parallel_size=2), the model weights are split across both GPUs, so each GPU holds half the weights, leaving more memory for KV cache per GPU. In this setup, the KV cache is also sharded (not duplicated), allowing you to serve longer contexts or more concurrent users per instance. In contrast, with data parallelism (separate servers), each GPU holds a full copy of the model and its own independent KV cache, so memory is not shared and you cannot serve longer contexts per user, but you can scale throughput linearly by load balancing requests across servers.

For a single large model and maximizing context length or concurrency per instance, tensor parallelism is better. For maximizing total throughput with many independent requests and smaller models, data parallelism (multiple servers) is often more efficient due to lower inter-GPU communication overhead. For your 12B model and 2x32GB GPUs, tensor parallelism will allow higher per-instance concurrency and context length, while data parallelism will maximize aggregate throughput if each instance is sufficient for your workload. See: Parallelism and Scaling, Optimization and Tuning, Data Parallel Deployment.

Would you like a step-by-step breakdown or code examples for configuring either approach?

Sources:

youlearnit · September 14, 2025, 7:56pm

thanks, some othe AI was saying KV cache is replicated on each GPU in tensor parallel, so good to know it was wrong.

RunLLM · September 14, 2025, 7:56pm

Correct—when using tensor parallelism in vLLM, the KV cache is sharded across GPUs, not replicated. Each GPU stores only the portion of the KV cache corresponding to the tokens and model weights it handles, which is why tensor parallelism frees up more memory for longer context or higher concurrency compared to data parallelism, where each GPU has a full copy of both model and KV cache (Optimization and Tuning, Conserving Memory).

Would you like more technical details or references on how KV cache sharding works in vLLM?

Sources:

Topic		Replies	Views
Data parallel and kv cache with multiple GPUs General	2	168	September 23, 2025
Is there recommended max upper limit for tensor parallel General	11	760	July 26, 2025
Run vLLM on two diffrent GPU General	1	699	May 21, 2025
Some question abound Data Parallel examples General	7	342	June 21, 2025
[Question] What if TP size < GPU Rank size? General	12	140	November 25, 2025

Data parallel or tensor parallel

Related topics