|
DeepSeek-V3 tool_choice="auto", not working but tool_choice="required" is working
|
|
4
|
648
|
October 13, 2025
|
|
Can we reuse cuda graph across layers?
|
|
2
|
65
|
October 9, 2025
|
|
MCP tool-server OpenAI responses API
|
|
3
|
798
|
September 25, 2025
|
|
Pass instructions to Qwen Embedding / Reranker via OpenAI-compatible server?
|
|
5
|
603
|
September 11, 2025
|
|
Is FCFS Scheduling Holding Back vLLm's Performance in Production?
|
|
3
|
168
|
September 11, 2025
|
|
General questions on structured output backend
|
|
9
|
738
|
September 3, 2025
|
|
Clarification: Does vLLM support concurrent decoding with multiple LoRA adapters in online inference?
|
|
1
|
393
|
August 29, 2025
|
|
How to do KV cache transfer between a CPU instance and a GPU instance?
|
|
1
|
217
|
July 31, 2025
|
|
Support for Deploying 4-bit Fine-Tuned Model with LoRA on vLLM
|
|
13
|
707
|
July 30, 2025
|
|
Does vllm support draft model use tp>1 when I use speculative decoding
|
|
1
|
136
|
July 29, 2025
|
|
Is there any roadmap to support prefix caching on dram and disk?
|
|
1
|
109
|
July 25, 2025
|
|
Performance Degradation and Compatibility Issues with AWQ Quantization in vLLM (Qwen2.5-VL-32B)
|
|
1
|
483
|
July 23, 2025
|
|
Multi-node K8s GPU pooling
|
|
3
|
364
|
July 17, 2025
|
|
Error trying to handle streaming tool call
|
|
3
|
401
|
July 17, 2025
|
|
Improving Speculative Decoding for Beginning Tokens & Structured Output
|
|
1
|
133
|
July 16, 2025
|
|
Question: Specifying Medusa Choice Tree in vllm
|
|
1
|
89
|
July 11, 2025
|
|
Disagg Prefill timeout
|
|
1
|
106
|
July 7, 2025
|
|
MoE quantization
|
|
9
|
1182
|
July 2, 2025
|
|
Why is cuda graph capture sizes limited by max_num_seqs
|
|
1
|
710
|
June 29, 2025
|
|
Scheduler in vllm
|
|
1
|
297
|
June 26, 2025
|
|
Prompt_embeds usage in vllm openai completion api
|
|
4
|
159
|
June 17, 2025
|
|
W8a8两种量化方式有详细介绍吗
|
|
1
|
185
|
June 15, 2025
|
|
Seqence Parallelism Support - Source Code Location
|
|
0
|
40
|
June 10, 2025
|
|
Something weired about the reading procedure of q_vecs in page attention kernel
|
|
3
|
24
|
June 9, 2025
|
|
Computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity?
|
|
1
|
56
|
June 2, 2025
|
|
Why does computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity?
|
|
3
|
93
|
June 2, 2025
|
|
APC Slowdown with block-size=1
|
|
1
|
73
|
May 26, 2025
|
|
RunBot's math-to-text on NVIDIA NeMo Framework AutoModel
|
|
11
|
97
|
May 19, 2025
|
|
Issue with DynamicYaRN and Key-Value Cache Reuse in vLLM
|
|
1
|
130
|
May 18, 2025
|
|
VUA - library code for LLM inference engines for external storage of KV caches
|
|
1
|
78
|
May 13, 2025
|