Latest Features topics

Topic	Replies	Views	Activity
DeepSeek-V3 tool_choice="auto", not working but tool_choice="required" is working Tool Calling	4	648	October 13, 2025
Can we reuse cuda graph across layers? Features	2	65	October 9, 2025
MCP tool-server OpenAI responses API Features	3	798	September 25, 2025
Pass instructions to Qwen Embedding / Reranker via OpenAI-compatible server? Features	5	603	September 11, 2025
Is FCFS Scheduling Holding Back vLLm's Performance in Production? Scheduling	3	168	September 11, 2025
General questions on structured output backend Structured Outputs	9	738	September 3, 2025
Clarification: Does vLLM support concurrent decoding with multiple LoRA adapters in online inference? LoRA	1	393	August 29, 2025
How to do KV cache transfer between a CPU instance and a GPU instance? KV-Cache	1	217	July 31, 2025
Support for Deploying 4-bit Fine-Tuned Model with LoRA on vLLM Quantization	13	707	July 30, 2025
Does vllm support draft model use tp>1 when I use speculative decoding Speculative Decoding	1	136	July 29, 2025
Is there any roadmap to support prefix caching on dram and disk? Disaggregated Prefilling	1	109	July 25, 2025
Performance Degradation and Compatibility Issues with AWQ Quantization in vLLM (Qwen2.5-VL-32B) Features	1	483	July 23, 2025
Multi-node K8s GPU pooling Features	3	364	July 17, 2025
Error trying to handle streaming tool call Tool Calling	3	401	July 17, 2025
Improving Speculative Decoding for Beginning Tokens & Structured Output Speculative Decoding	1	133	July 16, 2025
Question: Specifying Medusa Choice Tree in vllm Speculative Decoding	1	89	July 11, 2025
Disagg Prefill timeout Disaggregated Prefilling	1	106	July 7, 2025
MoE quantization Quantization	9	1182	July 2, 2025
Why is cuda graph capture sizes limited by max_num_seqs Scheduling	1	710	June 29, 2025
Scheduler in vllm Features	1	297	June 26, 2025
Prompt_embeds usage in vllm openai completion api Multi-modality	4	159	June 17, 2025
W8a8两种量化方式有详细介绍吗 Quantization	1	185	June 15, 2025
Seqence Parallelism Support - Source Code Location Features	0	40	June 10, 2025
Something weired about the reading procedure of q_vecs in page attention kernel KV-Cache	3	24	June 9, 2025
Computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity? KV-Cache	1	56	June 2, 2025
Why does computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity? Scheduling	3	93	June 2, 2025
APC Slowdown with block-size=1 KV-Cache	1	73	May 26, 2025
RunBot's math-to-text on NVIDIA NeMo Framework AutoModel LoRA	11	97	May 19, 2025
Issue with DynamicYaRN and Key-Value Cache Reuse in vLLM Features	1	130	May 18, 2025
VUA - library code for LLM inference engines for external storage of KV caches KV-Cache	1	78	May 13, 2025