LLM memory caching

Ranjith · May 7, 2026, 9:51am

@RunLLM, Does LLM queries that comes from a request does vllm store/cache the context in the requests?

RunLLM · May 7, 2026, 9:52am

vLLM does not automatically store or maintain conversation history (context) for each request. The client application is responsible for tracking and providing the full context (e.g., previous messages) with each query. vLLM can cache the KV cache for shared prefixes if you enable Automatic Prefix Caching (--enable-prefix-caching), which speeds up repeated or overlapping contexts, but it does not persist or manage user conversation history itself. The cache is stored in GPU memory, not on disk, and is used to avoid recomputation for identical prefixes across requests, not to store arbitrary request context or chat history for users.

For multi-turn conversations, you must send the entire conversation history as part of each request; vLLM will format this into a prompt using the chat template. If you want to benefit from prefix caching, ensure that repeated prefixes are present in your requests so vLLM can reuse the cached computation and reduce latency for overlapping contexts. See here, here, and here for details.

Would you like more detail on how to structure requests or how prefix caching works?

Sources:

Hint: Mention @RunLLM in the post for followups.

Ranjith · May 7, 2026, 10:32am

@RunLLM, Can vLLM accept multiple requests in a single http request?

RunLLM · May 7, 2026, 10:32am

Yes, vLLM’s OpenAI-compatible Completions API supports batched prompts, allowing you to send multiple prompts in a single HTTP request for completion generation. However, the Chat Completions API only supports one chat conversation per request; batching is not supported for chat-style requests. See this discussion for details.

Would you like more detail or example code for batched completions?

Sources:

https://github.com/vllm-project/vllm/issues/16965

Hint: Mention @RunLLM in the post for followups.

Ranjith · May 7, 2026, 12:13pm

@RunLLM Give me an example of batched requests in completions endpoint.

RunLLM · May 7, 2026, 12:13pm

You can send multiple prompts in a single HTTP request to the /v1/completions endpoint by passing a list of strings to the prompt field. For example:

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
completion = client.completions.create(
    model="your-model-name",
    prompt=["A robot may not injure a human being", "My name is"],
    max_tokens=20,
    echo=False,
    stream=False,
)
print(completion)

This will return completions for both prompts in one response. See this example for more details.

Would you like more explanation or a curl example?

Sources:

https://github.com/vllm-project/vllm/issues/16494

Hint: Mention @RunLLM in the post for followups.

Ranjith · May 7, 2026, 12:17pm

@RunLLM What is the tool-call-parser for the model Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 When served using vLLM?

RunLLM · May 7, 2026, 12:17pm

For Qwen/Qwen3-VL-30B-A3B-Instruct-FP8, the recommended tool-call-parser when serving with vLLM is hermes. Use the flag --tool-call-parser hermes when starting the server, as documented in the official vLLM tool calling guide.

Would you like more detail on tool calling setup for this model?

Sources:

https://docs.vllm.ai/en/stable/features/tool_calling.html#qwen-models

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
How should I set kv cache in vllm? General	3	266	November 18, 2025
Trace requests through vLLM v1 General	1	207	May 29, 2025
Support for (sparse) key value caching KV-Cache	16	570	May 3, 2025
Batched processing over openai api General	3	86	November 30, 2025
How to keep chat history in VLLM？ General	1	611	June 19, 2025

LLM memory caching

Related topics