@RunLLM, Does LLM queries that comes from a request does vllm store/cache the context in the requests?
vLLM does not automatically store or maintain conversation history (context) for each request. The client application is responsible for tracking and providing the full context (e.g., previous messages) with each query. vLLM can cache the KV cache for shared prefixes if you enable Automatic Prefix Caching (--enable-prefix-caching), which speeds up repeated or overlapping contexts, but it does not persist or manage user conversation history itself. The cache is stored in GPU memory, not on disk, and is used to avoid recomputation for identical prefixes across requests, not to store arbitrary request context or chat history for users.
For multi-turn conversations, you must send the entire conversation history as part of each request; vLLM will format this into a prompt using the chat template. If you want to benefit from prefix caching, ensure that repeated prefixes are present in your requests so vLLM can reuse the cached computation and reduce latency for overlapping contexts. See here, here, and here for details.
Would you like more detail on how to structure requests or how prefix caching works?
Sources:
- https://github.com/vllm-project/vllm/issues/2766
- https://docs.vllm.ai/en/latest/features/automatic_prefix_caching/
- https://github.com/vllm-project/vllm/issues/5796
Hint: Mention @RunLLM in the post for followups.
@RunLLM, Can vLLM accept multiple requests in a single http request?
Yes, vLLM’s OpenAI-compatible Completions API supports batched prompts, allowing you to send multiple prompts in a single HTTP request for completion generation. However, the Chat Completions API only supports one chat conversation per request; batching is not supported for chat-style requests. See this discussion for details.
Would you like more detail or example code for batched completions?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM Give me an example of batched requests in completions endpoint.
You can send multiple prompts in a single HTTP request to the /v1/completions endpoint by passing a list of strings to the prompt field. For example:
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
completion = client.completions.create(
model="your-model-name",
prompt=["A robot may not injure a human being", "My name is"],
max_tokens=20,
echo=False,
stream=False,
)
print(completion)
This will return completions for both prompts in one response. See this example for more details.
Would you like more explanation or a curl example?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM What is the tool-call-parser for the model Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 When served using vLLM?
For Qwen/Qwen3-VL-30B-A3B-Instruct-FP8, the recommended tool-call-parser when serving with vLLM is hermes. Use the flag --tool-call-parser hermes when starting the server, as documented in the official vLLM tool calling guide.
Would you like more detail on tool calling setup for this model?
Sources:
Hint: Mention @RunLLM in the post for followups.