I published a performance test result of vllm vs sglang but can someone help me explain it?

Hi, I am using vllm for all my projects but I had been thinking maybe I should give sglang a try. So I did a performance test against them. Before the test I had no idea what result I would get as I had no bias at all. So I was very surprised about the result!

I use one A10 GPU to test Qwen 2.5-7B, as I have a specific, focused goal: to evaluate how vLLM and SGLang perform when running a small LLM model on a mid-range NVIDIA GPU like A10.

I find that SGLang only uses 7G GPU memory compared with 21G memory (A10 has 24 G memory in total) and delivers a much better result, especially the consistent response times.

But why is such big difference ? Can someone help to explain it ? This is my project, GitHub - qiulang/vllm-sglang-perf: Evaluate how vLLM and SGLang perform when running a small LLM model on a mid-range NVIDIA GPU

Thanks a lot.

SGLang uses less GPU memory and delivers better performance due to its efficient memory management and batch scheduling. It only imports some fused kernels and layers from vLLM, using vLLM as a kernel library, which allows it to optimize memory usage and throughput effectively. This results in SGLang using significantly less memory for the same tasks compared to vLLM, which pre-allocates a large portion of GPU memory for its operations, including KV caching (source, source).

Would you like more detail on how SGLang achieves these optimizations?

Sources:

Thanks for the reply. I made a mistake at my initial test and used --max-total-tokens flag, I had thought that was the --max-model-length of SGLang, but I later found it was --context-length, with it the memory usage is basically the same.

SGLang performanced better in my test but it has a “warm-up” effect that does not exist in vllm.

I have futher updated my result in GitHub - qiulang/vllm-sglang-perf: Evaluate how vLLM and SGLang perform when running a small LLM model on a mid-range NVIDIA GPU

The “warm-up” effect you observed in SGLang, which is not present in vLLM, is likely due to the initial overhead in setting up the model and optimizing the execution environment. This can include compiling CUDA graphs or other optimizations that take place during the first few requests. Once these initial setups are complete, SGLang can perform more efficiently, which might explain the better performance after the warm-up period (source).

Would you like more detail on how these optimizations work in SGLang and vLLM?

Sources: