Im a master’s student( High Performance, I.R) researching a framework for evaluating Cache-Augmented Generation (CAG) models, and I had a question about the feasibility of an architecture using vLLM.
My idea is to:
1 - Deploy vLLM with tensor parallelism across multiple nodes to create a single, distributed inference engine.
2 - Use a large, shared document (e.g., a technical manual, then scale to find a 'limit) as a common context. I want to process this document once to “warm up” the distributed KV Cache, effectively creating a prefix cache of this shared context.
3 - Serve subsequent, short queries that all relate to this cached context, leveraging the pre-computed KV Cache state to avoid redundant processing.
Then, my goal is to use this setup to measure both system performance (throughput, latency) and IR quality (faithfulness, correctness..) for CAG workloads.
Does this align with the intended use cases for vLLM’s distributed features and prefix caching? I’m essentially looking for a gut check from experts whether this architecture is sound or if there are any potential pitfalls I should be aware of.
Yes, your proposed architecture aligns well with vLLM’s intended use cases. vLLM supports distributed inference using tensor parallelism across multiple nodes, and its Automatic Prefix Caching (APC) feature is specifically designed to cache the KV cache for shared prefixes, allowing subsequent queries with the same prefix (e.g., a large document) to reuse the pre-computed KV cache and avoid redundant computation. This setup is ideal for evaluating Cache-Augmented Generation (CAG) workloads, as it enables high-throughput, low-latency serving for queries sharing a common context, such as a technical manual. For distributed deployment, you can combine tensor and pipeline parallelism to scale across nodes, and APC will handle prefix reuse as long as the queries share the exact prefix and are routed to the same vLLM instance or data-parallel group. See Parallelism and Scaling and Automatic Prefix Caching for details.
Potential pitfalls: (1) Prefix caching is only effective if the query prefix exactly matches the cached context; even small differences will prevent cache reuse. (2) The cache is not shared across independent vLLM deployments or data-parallel groups, so all relevant queries must be routed to the same group. (3) The size of the shared context is limited by available GPU memory for the KV cache, so extremely large documents may require careful resource planning. (4) APC does not reduce the time for generating new tokens, only the prefill phase. For more, see APC documentation and distributed serving. Would you like a step-by-step breakdown or code examples for setting this up?