Hi, I have one question regarding benchmarking
Is running in docker/container recommended? I tried to benchmark vllm server’s performance using the vllm bench serve command. I played around with the following setups
setup 1: one client (docker vllm bench serve) sends requests to an vllm server (docker)
setup 2: one client (docker vllm bench serve) sends requests to two vllm servers (docker) load-balanced by an HAProxy
setup 3: two clients (docker) send request separately to two different vllm servers (docker) at the same time (assigned to GPU0 and GPU1, respectively )
setup 4: same as setup 1 except clients and servers are all host machine processes.
setup 5: same as setup 2 except clients and servers are all host machine processes.
setup 6: same as setup 3 except clients and servers are all host machine processes.
I would expect
Setups (1 and 4), (2 and 4), (5 and 6) would have approximately the same throughput, respectively
However, I observed
-
Setup 1 and 4 have approximately the same token throughout but setup 1 has roughly 30% higher latency
-
Setup 2 leads to a token throughput roughly 30% of setup 5
-
Each client of setup3 sees only 50% of throughput as it does in setup 1, whereas each client in setup 6 does not see any degradation compared to setup 4
FYI, I used the sharegpt dataset, with 256 as the output-length parameter in vllm bench serve. The load test lasts for 1000 requests. The vllm servers disables prefix caching and uses non-eager modes.
Given nvidia-smi’s GPU utilization, I suspect that the requests cannot reach to the vllm servers fast enough when I put everything docker, since setup4-6 all see 100% throughput during the load test, whereas GPU utilization fluctuates around 75% in the docker setup (setup1-3).
It would be of great help if someone could shine some light on this issue that I have baffled me for the past two days.
Since AIbrix is also in a container setup, I wonder if there is any prerequisite on the underlying machine/kernel, like specific kernel parameters that should be tuned