To benchmark your vLLM Docker setup with 2 GPUs and measure how many concurrent users it can handle for a given prompt, use the built-in benchmark scripts included in the vLLM Docker image. Start the vLLM server with your model and desired parallelism (e.g., --tensor-parallel-size 2), then run the benchmark_serving.py script from inside the container to simulate concurrent requests and measure throughput, latency, and concurrency. You can control the number of concurrent requests using the --max-concurrency flag in the benchmark script, which was added for this purpose. For example:
python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --num-prompts 100 --max-concurrency 32
This will report metrics such as request throughput, token throughput, and latency, helping you determine the maximum concurrency your setup can handle for your prompt size. For more details, see the official benchmarking documentation and max concurrency support. Benchmarks are included in the Docker image, so you do not need to clone the repo separately unless using a custom image.
Would you like a step-by-step example for your specific model and prompt?
Sources: