How does VRAM affect concurrent performance

aabbccddwasd · May 12, 2025, 10:33am

I want to use 85090 (256GB VRAM) to deploy qwen3-235B-FP8, according to this test. it seems like I’ able to use up to 64k context len in 256GB VRAM. but I wonder if I have higher concurrency with long context(no same prefix), like 8x, 32k context, will the throughput drops very fast? would it be better if I use 8 4090 48GB (384GB VRAM) so I have more spaces for kv cache?
my goal is to get higher throughput for concurrency <= 8 as possible as I can. I’m stuck with this decision now because in single request 5090 is nearly 30% faster than 4090, someone help me please

RunLLM · May 12, 2025, 10:33am

It seems likely that for high concurrency with long, unique contexts (no shared prefix), throughput will be limited by available KV cache memory, not just raw GPU speed. While a single 5090 is faster than a 4090, using 8x4090s (384GB total VRAM) gives you more total KV cache space than 8x5090s (256GB total VRAM), which may allow you to serve more concurrent long-context requests before hitting cache limits and seeing throughput drop sharply. This is because vLLM’s throughput at high concurrency is often bottlenecked by KV cache space, especially for long contexts without prefix sharing, as described in this discussion and here.

My understanding is that, for your goal (max throughput at concurrency ≤ 8 with long contexts), the 8x4090 setup may be preferable despite the lower per-GPU speed, since it provides more aggregate VRAM for KV cache, which is critical for high-concurrency, long-context workloads. Would you like a step-by-step breakdown of how KV cache usage and throughput scale with these hardware choices?

Sources:

Topic		Replies	Views
Run vLLM on two diffrent GPU General	1	62	May 21, 2025
I published a performance test result of vllm vs sglang but can someone help me explain it? General	3	93	April 29, 2025
Making best use of varying GPU generations NVIDIA GPU Support	2	77	April 11, 2025
Help needed with vLLM Qwen3-32B inference performance tuning (long input slow) General	1	75	May 26, 2025
Which arguments affect GPU memory General	1	95	April 29, 2025

How does VRAM affect concurrent performance

Related topics