How is the KV cache used in multipe GPU setup when using Data parallel? is it same as in Tensor parallel that both GPUS available vram after model weights is available for KV cache?
If I run 2 separate 32GB 5090 behind load balancer, the KV cache is like doubled, both GPUs has their own KV cache? With tp=2 all the available VRAM is for one copy of KV cache?
In a data parallel (DP) setup, each GPU runs a full replica of the model and maintains its own independent KV cache. This means if you run two separate 32GB GPUs behind a load balancer (DP=2), each GPU has its own KV cache, and the total KV cache capacity is effectively doubled, but each request only uses the KV cache on the GPU it lands on—there is no sharing between GPUs. In contrast, with tensor parallelism (TP=2), the model weights are split across both GPUs, and the available VRAM on both is pooled for a single, sharded KV cache, so each request can use the combined memory for longer contexts or more concurrent requests. Thus, in TP, the KV cache is distributed and shared, while in DP, it is duplicated and isolated per GPU instance.
To summarize: DP gives each GPU its own KV cache (no sharing, but total capacity is doubled for independent requests), while TP shards both model weights and KV cache across GPUs (shared, larger single KV cache per request). For more details, see vLLM Data Parallel Deployment and Optimization and Tuning.
Would you like a step-by-step breakdown or example launch commands?