|
About the Features category
|
|
0
|
67
|
March 20, 2025
|
|
GPTQModel 能量化 GLM-5 FP16 到 INT8 吗
|
|
9
|
52
|
April 24, 2026
|
|
DeepSeek MTP full cuda graph support?
|
|
3
|
41
|
April 13, 2026
|
|
Qwen3.5-27B-FP8 Speculative Decoding
|
|
2
|
1555
|
April 11, 2026
|
|
thinking_token_budget silently ignored when passed via extra_args in vLLM 0.18.0
|
|
1
|
91
|
April 11, 2026
|
|
GLM 5 / Kimi k2.5 on 4 x RTX 6000 Pro
|
|
1
|
151
|
March 22, 2026
|
|
Compressed Multimodal embeddings inputs
|
|
1
|
26
|
March 18, 2026
|
|
NVFP4 Support In Attention
|
|
1
|
337
|
March 16, 2026
|
|
Distributed Speculative Decoding using Ray
|
|
3
|
96
|
February 11, 2026
|
|
Deployment example for a qwen3 model with hybrid thinking
|
|
10
|
1574
|
February 4, 2026
|
|
How do I precompute multimodal embeddings?
|
|
5
|
161
|
February 2, 2026
|
|
Implementing hidden state probes
|
|
1
|
41
|
January 30, 2026
|
|
Standalone draft model spec decode support in v0.x and v1
|
|
3
|
124
|
January 20, 2026
|
|
How to get kv cache value from vllm
|
|
5
|
249
|
January 19, 2026
|
|
Is there a plan for EVS to support Qwen3VL in response to the issue of sparse video tokens?
|
|
1
|
75
|
January 13, 2026
|
|
Exposing KV cache for recomposition / reuse beyond prefix caching?
|
|
1
|
78
|
January 13, 2026
|
|
Why I feel cuda-kernel marlin run not fast?
|
|
5
|
132
|
January 9, 2026
|
|
Can reasoning_effort parameter not ne used in vllm implementation via python?
|
|
1
|
262
|
January 2, 2026
|
|
Understanding vllm kv cache
|
|
5
|
1042
|
December 1, 2025
|
|
Has anyone successfully run DBO in a single node multi card environment?
|
|
1
|
77
|
December 1, 2025
|
|
EPLB behavior in elastic scaling
|
|
21
|
144
|
November 28, 2025
|
|
Qwen2.5 VL开启flashinfer失败
|
|
5
|
262
|
November 24, 2025
|
|
如何提升在单机多卡部署时的吞吐量
|
|
10
|
525
|
November 24, 2025
|
|
Does LMFE_STRICT_JSON_FIELD_ORDER not work?
|
|
4
|
70
|
November 19, 2025
|
|
Clarification Needed on Testing Elastic EP and Its Library Installation Dependencies
|
|
1
|
36
|
November 19, 2025
|
|
Custom modality
|
|
3
|
39
|
November 14, 2025
|
|
Asking 6-bit Quantization
|
|
1
|
149
|
November 11, 2025
|
|
Expert offloading
|
|
1
|
500
|
November 11, 2025
|
|
Raw tokens completion via online serving
|
|
1
|
94
|
November 3, 2025
|
|
vLLM extremely slow / no response with max_model_len=8192 and multi-GPU tensor parallel
|
|
1
|
878
|
October 26, 2025
|