Features

Tool Calling Ray LoRA Multi-modality Reasoning Outputs Quantization Disaggregated Prefilling KV-Cache Any topics related to KV-cache. For examples: Prefix caching, Hybrid KV cache, KV cache offloading, KV cache transfer, etc Speculative Decoding Scheduling Any topics relate to scheduling. For examples: Scheduling policies, Scheduler behaviors, Pluggable scheduler, etc Structured Outputs

Topic		Replies	Views	Activity
About the Features category Features		0	28	March 20, 2025
Is it possible to configure the order of the pipeline in multi-node deployments? Features		3	7	October 16, 2025
Question on Advanced vLLM Use Case: Distributed Prefix Caching for a CAG Evaluation Framework KV-Cache		1	11	October 15, 2025
A bit of frustration with Quantization Quantization		5	67	October 14, 2025
DeepSeek-V3 tool_choice="auto", not working but tool_choice="required" is working Tool Calling		4	304	October 13, 2025
Can we reuse cuda graph across layers? Features		2	14	October 9, 2025
MCP tool-server OpenAI responses API Features		3	122	September 25, 2025
Pass instructions to Qwen Embedding / Reranker via OpenAI-compatible server? Features		5	145	September 11, 2025
Is FCFS Scheduling Holding Back vLLm's Performance in Production? Scheduling		3	51	September 11, 2025
General questions on structured output backend Structured Outputs		9	164	September 3, 2025
Clarification: Does vLLM support concurrent decoding with multiple LoRA adapters in online inference? LoRA		1	110	August 29, 2025
Deployment example for a qwen3 model with hybrid thinking Reasoning Outputs		7	269	August 26, 2025
When using large batches, the Ray service crashes.ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read Ray		40	497	August 7, 2025
How to do KV cache transfer between a CPU instance and a GPU instance? KV-Cache		1	104	July 31, 2025
Support for Deploying 4-bit Fine-Tuned Model with LoRA on vLLM Quantization		13	203	July 30, 2025
Does vllm support draft model use tp>1 when I use speculative decoding Speculative Decoding		1	60	July 29, 2025
Is there any roadmap to support prefix caching on dram and disk? Disaggregated Prefilling		1	67	July 25, 2025
Performance Degradation and Compatibility Issues with AWQ Quantization in vLLM (Qwen2.5-VL-32B) Features		1	238	July 23, 2025
Multi-node K8s GPU pooling Features		3	124	July 17, 2025
Error trying to handle streaming tool call Tool Calling		3	178	July 17, 2025
Improving Speculative Decoding for Beginning Tokens & Structured Output Speculative Decoding		1	70	July 16, 2025
Question: Specifying Medusa Choice Tree in vllm Speculative Decoding		1	36	July 11, 2025
Disagg Prefill timeout Disaggregated Prefilling		1	54	July 7, 2025
MoE quantization Quantization		9	743	July 2, 2025
Why is cuda graph capture sizes limited by max_num_seqs Scheduling		1	337	June 29, 2025
Scheduler in vllm Features		1	148	June 26, 2025
Prompt_embeds usage in vllm openai completion api Multi-modality		4	70	June 17, 2025
W8a8两种量化方式有详细介绍吗 Quantization		1	94	June 15, 2025
Seqence Parallelism Support - Source Code Location Features		0	25	June 10, 2025
Something weired about the reading procedure of q_vecs in page attention kernel KV-Cache		3	13	June 9, 2025