|
About the Features category
|
|
0
|
36
|
March 20, 2025
|
|
Understanding vllm kv cache
|
|
5
|
37
|
December 1, 2025
|
|
Has anyone successfully run DBO in a single node multi card environment?
|
|
2
|
14
|
December 3, 2025
|
|
EPLB behavior in elastic scaling
|
|
21
|
15
|
November 28, 2025
|
|
Qwen2.5 VL开启flashinfer失败
|
|
5
|
29
|
November 24, 2025
|
|
如何提升在单机多卡部署时的吞吐量
|
|
10
|
63
|
November 24, 2025
|
|
Does LMFE_STRICT_JSON_FIELD_ORDER not work?
|
|
4
|
24
|
November 19, 2025
|
|
Clarification Needed on Testing Elastic EP and Its Library Installation Dependencies
|
|
1
|
22
|
November 19, 2025
|
|
Custom modality
|
|
3
|
14
|
November 14, 2025
|
|
Asking 6-bit Quantization
|
|
1
|
29
|
November 11, 2025
|
|
Expert offloading
|
|
1
|
74
|
November 11, 2025
|
|
Raw tokens completion via online serving
|
|
1
|
33
|
November 3, 2025
|
|
Deployment example for a qwen3 model with hybrid thinking
|
|
8
|
623
|
October 29, 2025
|
|
vLLM extremely slow / no response with max_model_len=8192 and multi-GPU tensor parallel
|
|
1
|
180
|
October 26, 2025
|
|
When using large batches, the Ray service crashes.ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read
|
|
41
|
875
|
October 26, 2025
|
|
Why vllm does not support LMP?
|
|
3
|
62
|
October 23, 2025
|
|
Is it possible to configure the order of the pipeline in multi-node deployments?
|
|
3
|
23
|
October 16, 2025
|
|
Question on Advanced vLLM Use Case: Distributed Prefix Caching for a CAG Evaluation Framework
|
|
1
|
60
|
October 15, 2025
|
|
A bit of frustration with Quantization
|
|
5
|
324
|
October 14, 2025
|
|
DeepSeek-V3 tool_choice="auto", not working but tool_choice="required" is working
|
|
4
|
412
|
October 13, 2025
|
|
Can we reuse cuda graph across layers?
|
|
2
|
43
|
October 9, 2025
|
|
MCP tool-server OpenAI responses API
|
|
3
|
423
|
September 25, 2025
|
|
Pass instructions to Qwen Embedding / Reranker via OpenAI-compatible server?
|
|
5
|
298
|
September 11, 2025
|
|
Is FCFS Scheduling Holding Back vLLm's Performance in Production?
|
|
3
|
97
|
September 11, 2025
|
|
General questions on structured output backend
|
|
9
|
395
|
September 3, 2025
|
|
Clarification: Does vLLM support concurrent decoding with multiple LoRA adapters in online inference?
|
|
1
|
215
|
August 29, 2025
|
|
How to do KV cache transfer between a CPU instance and a GPU instance?
|
|
1
|
153
|
July 31, 2025
|
|
Support for Deploying 4-bit Fine-Tuned Model with LoRA on vLLM
|
|
13
|
404
|
July 30, 2025
|
|
Does vllm support draft model use tp>1 when I use speculative decoding
|
|
1
|
105
|
July 29, 2025
|
|
Is there any roadmap to support prefix caching on dram and disk?
|
|
1
|
81
|
July 25, 2025
|