|
Welcome to vLLM Forums! :wave:
|
|
3
|
1109
|
November 12, 2025
|
|
About the General category
|
|
0
|
66
|
March 17, 2025
|
|
OpenAI Embeddings Not Working
|
|
2
|
6
|
January 22, 2026
|
|
GLM-4.7-Flash with nvidia
|
|
9
|
82
|
January 22, 2026
|
|
怎么理解custom_all_reduce stage2的跨设备内存可见性注释
|
|
6
|
95
|
January 22, 2026
|
|
vLLM Tensor Parallel Workers Not Completing Initialization
|
|
3
|
5
|
January 21, 2026
|
|
Max_tokens_per_doc support for rerank models
|
|
1
|
2
|
January 21, 2026
|
|
一个长输入的请求,切chunk ,比如切了4份,prefill的时候,这四个可以同时做prefill 吗 ,还是有依赖关系的
|
|
15
|
31
|
January 21, 2026
|
|
Persistent segfaults/SIGSEGV
|
|
1
|
4
|
January 20, 2026
|
|
Why is it so slow to build a odeVLLM from source using Docker?
|
|
39
|
67
|
January 17, 2026
|
|
HarmonyError: Unexpected token 200002 while expecting start token 200006
|
|
1
|
32
|
January 14, 2026
|
|
Issue: Unable to pass precomputed image embeddings to vLLM
|
|
12
|
46
|
January 14, 2026
|
|
Clarify VLLM Wheels: What Does the +cu129 Tag Actually Change in v0.11.x?
|
|
1
|
12
|
January 13, 2026
|
|
Why Does Latency Remain Unchanged in vLLM 0.11.0 When Input Token Count Decreases for qwen3-vl-30b-a3b?
|
|
1
|
11
|
January 13, 2026
|
|
Why doesn't the parameter n in samplingparams work as expected
|
|
4
|
182
|
January 13, 2026
|
|
vLLM Engine Arguments Documentation
|
|
1
|
24
|
January 12, 2026
|
|
vLLM running on NVIDIA NIM vs Native VLLM tunning options
|
|
1
|
35
|
January 10, 2026
|
|
Pp8并行,update_from_output 会等所有rank的 model_executor.execute_model 执行完了之后才会执行吗
|
|
84
|
182
|
January 8, 2026
|
|
图片放在text后面导致精度下降问题
|
|
5
|
23
|
January 7, 2026
|
|
什么情况下,一个请求会被重复schedule?
|
|
58
|
87
|
January 6, 2026
|
|
How to set different attention backend for prefill and decode stage?
|
|
2
|
40
|
January 6, 2026
|
|
Active vs Reserved GPU Memory
|
|
1
|
14
|
January 5, 2026
|
|
Sampler.hip:564:63: error: local memory (66032) exceeds limit (65536) in 'void vllm::topKPerRowDecode<1024, true, false, true>'
|
|
1
|
13
|
January 5, 2026
|
|
Vllm0.11.0适配的llmcompressor 版本是什么
|
|
1
|
13
|
January 5, 2026
|
|
How to inference or deploy with my custom model
|
|
1
|
46
|
January 5, 2026
|
|
Question about full cudagraph of FlashAttention-v2
|
|
13
|
32
|
January 5, 2026
|
|
RTX 5090 + GLM incompatible issues - Please update
|
|
2
|
112
|
January 4, 2026
|
|
Vllm推理指标如何做可视化?
|
|
1
|
32
|
January 4, 2026
|
|
如何尽可能提升推理服务的吞吐量
|
|
3
|
123
|
January 4, 2026
|
|
Which software components vLLM inference needs
|
|
3
|
50
|
December 30, 2025
|