I want to use 85090 (256GB VRAM) to deploy qwen3-235B-FP8, according to this test. it seems like I’ able to use up to 64k context len in 256GB VRAM. but I wonder if I have higher concurrency with long context(no same prefix), like 8x, 32k context, will the throughput drops very fast? would it be better if I use 8 4090 48GB (384GB VRAM) so I have more spaces for kv cache?
my goal is to get higher throughput for concurrency <= 8 as possible as I can. I’m stuck with this decision now because in single request 5090 is nearly 30% faster than 4090, someone help me please
It seems likely that for high concurrency with long, unique contexts (no shared prefix), throughput will be limited by available KV cache memory, not just raw GPU speed. While a single 5090 is faster than a 4090, using 8x4090s (384GB total VRAM) gives you more total KV cache space than 8x5090s (256GB total VRAM), which may allow you to serve more concurrent long-context requests before hitting cache limits and seeing throughput drop sharply. This is because vLLM’s throughput at high concurrency is often bottlenecked by KV cache space, especially for long contexts without prefix sharing, as described in this discussion and here.
My understanding is that, for your goal (max throughput at concurrency ≤ 8 with long contexts), the 8x4090 setup may be preferable despite the lower per-GPU speed, since it provides more aggregate VRAM for KV cache, which is critical for high-concurrency, long-context workloads. Would you like a step-by-step breakdown of how KV cache usage and throughput scale with these hardware choices?
Sources: