Benchmarking vllm performance - Having trouble sending requests to vllm servers

Hi, I have one question regarding benchmarking
Is running in docker/container recommended? I tried to benchmark vllm server’s performance using the vllm bench serve command. I played around with the following setups

setup 1: one client (docker vllm bench serve) sends requests to an vllm server (docker)
setup 2: one client (docker vllm bench serve) sends requests to two vllm servers (docker) load-balanced by an HAProxy
setup 3: two clients (docker) send request separately to two different vllm servers (docker) at the same time (assigned to GPU0 and GPU1, respectively )
setup 4: same as setup 1 except clients and servers are all host machine processes.
setup 5: same as setup 2 except clients and servers are all host machine processes.
setup 6: same as setup 3 except clients and servers are all host machine processes.

I would expect

Setups (1 and 4), (2 and 4), (5 and 6) would have approximately the same throughput, respectively

However, I observed

  1. Setup 1 and 4 have approximately the same token throughout but setup 1 has roughly 30% higher latency

  2. Setup 2 leads to a token throughput roughly 30% of setup 5

  3. Each client of setup3 sees only 50% of throughput as it does in setup 1, whereas each client in setup 6 does not see any degradation compared to setup 4

FYI, I used the sharegpt dataset, with 256 as the output-length parameter in vllm bench serve. The load test lasts for 1000 requests. The vllm servers disables prefix caching and uses non-eager modes.
Given nvidia-smi’s GPU utilization, I suspect that the requests cannot reach to the vllm servers fast enough when I put everything docker, since setup4-6 all see 100% throughput during the load test, whereas GPU utilization fluctuates around 75% in the docker setup (setup1-3).

It would be of great help if someone could shine some light on this issue that I have baffled me for the past two days.

Since AIbrix is also in a container setup, I wonder if there is any prerequisite on the underlying machine/kernel, like specific kernel parameters that should be tuned

It seems likely that running vLLM in Docker containers is generally supported and recommended for reproducibility and environment consistency, but some users have observed higher latency and lower throughput in containerized setups compared to running directly on the host, especially under heavy load. This may be due to Docker’s default networking, shared memory, or resource isolation settings, which can throttle inter-process communication and limit GPU utilization. The vLLM docs recommend using --ipc=host or increasing --shm-size to allow the container to access host shared memory, which is important for tensor parallel inference and can help mitigate some performance issues. There is no explicit mention in the docs of kernel parameters that must be tuned, but ensuring that Docker is configured for high-performance workloads (e.g., using host networking, sufficient shared memory, and privileged mode if needed) is advised for benchmarking and production use cases.

For multi-GPU or multi-server setups, the vLLM documentation and user reports suggest that careful attention to Docker networking, shared memory, and GPU assignment is necessary to avoid bottlenecks. If you observe lower GPU utilization or throughput in Docker, try running with --ipc=host, increasing --shm-size, and using host networking (--network=host). Also, ensure that the container is not resource-constrained by cgroups or Docker’s default limits. For benchmarking, running both client and server on the host (without Docker) may yield the most accurate performance numbers, but with proper Docker configuration, the gap can be minimized. There is no explicit mention of required kernel parameter tuning in the official vLLM docs, but tuning may be needed for very high-throughput or low-latency scenarios, especially in containerized environments. Would you like more detail on specific Docker or kernel settings to try for your setup?

Sources: