How to benchmark concurrency

RunLLM · September 15, 2025, 6:53am

To benchmark your vLLM Docker setup with 2 GPUs and measure how many concurrent users it can handle for a given prompt, use the built-in benchmark scripts included in the vLLM Docker image. Start the vLLM server with your model and desired parallelism (e.g., --tensor-parallel-size 2), then run the benchmark_serving.py script from inside the container to simulate concurrent requests and measure throughput, latency, and concurrency. You can control the number of concurrent requests using the --max-concurrency flag in the benchmark script, which was added for this purpose. For example:

python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --num-prompts 100 --max-concurrency 32

This will report metrics such as request throughput, token throughput, and latency, helping you determine the maximum concurrency your setup can handle for your prompt size. For more details, see the official benchmarking documentation and max concurrency support. Benchmarks are included in the Docker image, so you do not need to clone the repo separately unless using a custom image.

Would you like a step-by-step example for your specific model and prompt?

Sources:

Topic		Replies	Views
Benchmarking vllm performance - Having trouble sending requests to vllm servers General	1	168	August 18, 2025
Vllm bench serve not all requests are successful. whats the reason? Benchmarking	5	400	October 23, 2025
Vllm bench serve + Bearer API key + HTTPS Benchmarking	1	622	August 7, 2025
Running vllm bench serve from CPU-only node Benchmarking	3	737	August 29, 2025
VLLM 0.10.1 benchmark do not free memory Benchmarking	13	228	November 10, 2025

How to benchmark concurrency

Related topics