Welcome to vLLM Forums! :wave:
|
|
1
|
386
|
March 24, 2025
|
Questions on piecewise torch compile design
|
|
1
|
1
|
May 7, 2025
|
Docker explosion this morning after it worked fine for a long while
|
|
6
|
6
|
May 6, 2025
|
vLLM cannot connect to existing Ray cluster
|
|
14
|
10
|
May 6, 2025
|
Offline multi-node inference
|
|
7
|
6
|
May 6, 2025
|
What is the deal with the runllm bot?
|
|
3
|
22
|
May 6, 2025
|
How can I find top-down architecture of vLLM?
|
|
1
|
19
|
May 5, 2025
|
32GB vs 48GB vRam
|
|
1
|
7
|
May 3, 2025
|
Support for (sparse) key value caching
|
|
16
|
20
|
May 3, 2025
|
What is the optimal combination of parallelism when deploying deepseek-r1 with multiple nodes?
|
|
0
|
9
|
May 2, 2025
|
How to use speculative decoding?
|
|
3
|
17
|
May 1, 2025
|
Run on B200/5090 without building from source?
|
|
1
|
8
|
May 1, 2025
|
Running Gemma 3 on multi-chip TPU failure
|
|
5
|
98
|
May 1, 2025
|
Go Deeper / Go Dumber by vLLM AI bot
|
|
5
|
54
|
May 1, 2025
|
Flash Attention 3 FP8 Support
|
|
1
|
20
|
April 30, 2025
|
Grammar CPU bound performance
|
|
9
|
101
|
April 29, 2025
|
Setting up VllmConfig for custom gpt2
|
|
2
|
18
|
April 29, 2025
|
I published a performance test result of vllm vs sglang but can someone help me explain it?
|
|
3
|
45
|
April 29, 2025
|
Which arguments affect GPU memory
|
|
1
|
16
|
April 29, 2025
|
Is Batch Inference for Multimodal Models Truly Batch Inference?
|
|
3
|
20
|
April 29, 2025
|
How to get `http_*` metrics as this doc suggests are available?
|
|
1
|
16
|
April 28, 2025
|
Extract class log probabilities from LLM classifier using async vLLM and AWS LMI containers
|
|
1
|
8
|
April 28, 2025
|
Does vLLM support multiple model_executor?
|
|
1
|
8
|
April 28, 2025
|
Question about V1 code(v0.8.2)
|
|
3
|
15
|
April 28, 2025
|
Failed to run distribute Inference with vllm 0.8.2
|
|
6
|
47
|
April 27, 2025
|
Intergate async-llm to openrlfh, when tensor_parallel_size > 1 will cause ray-error
|
|
1
|
17
|
April 27, 2025
|
Spec decode with eagle get very low Draft acceptance rate
|
|
1
|
36
|
April 25, 2025
|
Cuda failure 'out of memory'
|
|
1
|
39
|
April 24, 2025
|
Why vLLM uses alot of CPU memory
|
|
1
|
87
|
April 21, 2025
|
LoRA Adapter enabling with vLLM is not working
|
|
4
|
55
|
April 21, 2025
|