Hi, I am using vllm for all my projects but I had been thinking maybe I should give sglang a try. So I did a performance test against them. Before the test I had no idea what result I would get as I had no bias at all. So I was very surprised about the result!
I use one A10 GPU to test Qwen 2.5-7B, as I have a specific, focused goal: to evaluate how vLLM and SGLang perform when running a small LLM model on a mid-range NVIDIA GPU like A10.
I find that SGLang only uses 7G GPU memory compared with 21G memory (A10 has 24 G memory in total) and delivers a much better result, especially the consistent response times.
SGLang uses less GPU memory and delivers better performance due to its efficient memory management and batch scheduling. It only imports some fused kernels and layers from vLLM, using vLLM as a kernel library, which allows it to optimize memory usage and throughput effectively. This results in SGLang using significantly less memory for the same tasks compared to vLLM, which pre-allocates a large portion of GPU memory for its operations, including KV caching (source, source).
Would you like more detail on how SGLang achieves these optimizations?
Thanks for the reply. I made a mistake at my initial test and used --max-total-tokens flag, I had thought that was the --max-model-length of SGLang, but I later found it was --context-length, with it the memory usage is basically the same.
SGLang performanced better in my test but it has a “warm-up” effect that does not exist in vllm.
The “warm-up” effect you observed in SGLang, which is not present in vLLM, is likely due to the initial overhead in setting up the model and optimizing the execution environment. This can include compiling CUDA graphs or other optimizations that take place during the first few requests. Once these initial setups are complete, SGLang can perform more efficiently, which might explain the better performance after the warm-up period (source).
Would you like more detail on how these optimizations work in SGLang and vLLM?