|
Is there any roadmap to support prefix caching on dram and disk?
|
|
1
|
82
|
July 25, 2025
|
|
Performance Degradation and Compatibility Issues with AWQ Quantization in vLLM (Qwen2.5-VL-32B)
|
|
1
|
361
|
July 23, 2025
|
|
Multi-node K8s GPU pooling
|
|
3
|
228
|
July 17, 2025
|
|
Error trying to handle streaming tool call
|
|
3
|
269
|
July 17, 2025
|
|
Improving Speculative Decoding for Beginning Tokens & Structured Output
|
|
1
|
109
|
July 16, 2025
|
|
Question: Specifying Medusa Choice Tree in vllm
|
|
1
|
70
|
July 11, 2025
|
|
Disagg Prefill timeout
|
|
1
|
64
|
July 7, 2025
|
|
MoE quantization
|
|
9
|
990
|
July 2, 2025
|
|
Why is cuda graph capture sizes limited by max_num_seqs
|
|
1
|
512
|
June 29, 2025
|
|
Scheduler in vllm
|
|
1
|
245
|
June 26, 2025
|
|
Prompt_embeds usage in vllm openai completion api
|
|
4
|
94
|
June 17, 2025
|
|
W8a8两种量化方式有详细介绍吗
|
|
1
|
144
|
June 15, 2025
|
|
Seqence Parallelism Support - Source Code Location
|
|
0
|
32
|
June 10, 2025
|
|
Something weired about the reading procedure of q_vecs in page attention kernel
|
|
3
|
13
|
June 9, 2025
|
|
Computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity?
|
|
1
|
41
|
June 2, 2025
|
|
Why does computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity?
|
|
3
|
54
|
June 2, 2025
|
|
APC Slowdown with block-size=1
|
|
1
|
59
|
May 26, 2025
|
|
RunBot's math-to-text on NVIDIA NeMo Framework AutoModel
|
|
11
|
81
|
May 19, 2025
|
|
Issue with DynamicYaRN and Key-Value Cache Reuse in vLLM
|
|
1
|
105
|
May 18, 2025
|
|
VUA - library code for LLM inference engines for external storage of KV caches
|
|
1
|
58
|
May 13, 2025
|
|
Specifying special tokens
|
|
5
|
462
|
May 8, 2025
|
|
vLLM cannot connect to existing Ray cluster
|
|
16
|
883
|
May 8, 2025
|
|
Support for (sparse) key value caching
|
|
16
|
384
|
May 3, 2025
|
|
How to use speculative decoding?
|
|
3
|
577
|
May 1, 2025
|
|
Grammar CPU bound performance
|
|
9
|
411
|
April 29, 2025
|
|
Does vLLM support multiple model_executor?
|
|
1
|
257
|
April 28, 2025
|
|
Spec decode with eagle get very low Draft acceptance rate
|
|
1
|
247
|
April 25, 2025
|
|
LoRA Adapter enabling with vLLM is not working
|
|
4
|
361
|
April 21, 2025
|
|
Goodput Guided Speculative Decoding
|
|
2
|
192
|
April 19, 2025
|
|
Is structured output compatible with automatic prefix caching?
|
|
1
|
84
|
April 14, 2025
|