I published a performance test result of vllm vs sglang but can someone help me explain it?

qiulang · April 28, 2025, 4:25pm

Hi, I am using vllm for all my projects but I had been thinking maybe I should give sglang a try. So I did a performance test against them. Before the test I had no idea what result I would get as I had no bias at all. So I was very surprised about the result!

I use one A10 GPU to test Qwen 2.5-7B, as I have a specific, focused goal: to evaluate how vLLM and SGLang perform when running a small LLM model on a mid-range NVIDIA GPU like A10.

I find that SGLang only uses 7G GPU memory compared with 21G memory (A10 has 24 G memory in total) and delivers a much better result, especially the consistent response times.

But why is such big difference ? Can someone help to explain it ? This is my project, GitHub - qiulang/vllm-sglang-perf: Evaluate how vLLM and SGLang perform when running a small LLM model on a mid-range NVIDIA GPU

Thanks a lot.

RunLLM · April 28, 2025, 4:26pm

SGLang uses less GPU memory and delivers better performance due to its efficient memory management and batch scheduling. It only imports some fused kernels and layers from vLLM, using vLLM as a kernel library, which allows it to optimize memory usage and throughput effectively. This results in SGLang using significantly less memory for the same tasks compared to vLLM, which pre-allocates a large portion of GPU memory for its operations, including KV caching (source, source).

Would you like more detail on how SGLang achieves these optimizations?

Sources:

qiulang · April 29, 2025, 7:38am

Thanks for the reply. I made a mistake at my initial test and used --max-total-tokens flag, I had thought that was the --max-model-length of SGLang, but I later found it was --context-length, with it the memory usage is basically the same.

SGLang performanced better in my test but it has a “warm-up” effect that does not exist in vllm.

I have futher updated my result in GitHub - qiulang/vllm-sglang-perf: Evaluate how vLLM and SGLang perform when running a small LLM model on a mid-range NVIDIA GPU

RunLLM · April 29, 2025, 7:39am

The “warm-up” effect you observed in SGLang, which is not present in vLLM, is likely due to the initial overhead in setting up the model and optimizing the execution environment. This can include compiling CUDA graphs or other optimizations that take place during the first few requests. Once these initial setups are complete, SGLang can perform more efficiently, which might explain the better performance after the warm-up period (source).

Would you like more detail on how these optimizations work in SGLang and vLLM?

Sources:

https://github.com/qiulang/vllm-sglang-perf

Topic		Replies	Views
GLM4.5 V memory leak on inference V1 Feedback	9	64	August 28, 2025
Run vLLM on two diffrent GPU General	1	293	May 21, 2025
V1 has lower end-to-end performance than V0(--num-scheduler-steps=8) General	1	54	June 11, 2025
Why vLLM uses alot of CPU memory General	1	718	April 21, 2025
vLLM v1 forces me to pre-allocate a huge, non-reclaimable GPU KV cache for long contexts, and none of the current offload or quantization options solve the resulting VRAM bloat without crippling speed. V1 Feedback	5	59	September 8, 2025

I published a performance test result of vllm vs sglang but can someone help me explain it?

Related topics