How to benchmark concurrency

youlearnit · September 15, 2025, 6:52am

Hi,

How should I benchmark vLLM docker setup with 2 GPUs in ubuntu 24.04. I want to know how many concurrent users in certain prompt it can handle?

RunLLM · September 15, 2025, 6:53am

To benchmark your vLLM Docker setup with 2 GPUs and measure how many concurrent users it can handle for a given prompt, use the built-in benchmark scripts included in the vLLM Docker image. Start the vLLM server with your model and desired parallelism (e.g., --tensor-parallel-size 2), then run the benchmark_serving.py script from inside the container to simulate concurrent requests and measure throughput, latency, and concurrency. You can control the number of concurrent requests using the --max-concurrency flag in the benchmark script, which was added for this purpose. For example:

python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --num-prompts 100 --max-concurrency 32

This will report metrics such as request throughput, token throughput, and latency, helping you determine the maximum concurrency your setup can handle for your prompt size. For more details, see the official benchmarking documentation and max concurrency support. Benchmarks are included in the Docker image, so you do not need to clone the repo separately unless using a custom image.

Would you like a step-by-step example for your specific model and prompt?

Sources:

youlearnit · September 15, 2025, 7:04am

ubuntu@ubuntuAIhome01:~$ docker exec -it vllm_download_and_serve10 bash
root@ubuntuAIhome01:/vllm-workspace# python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --num-prompts 100 --max-concurrency 32
bash: python: command not found
root@ubuntuAIhome01:/vllm-workspace# python3 benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --num-prompts 100 --max-concurrency 32
Traceback (most recent call last):
File “/vllm-workspace/benchmarks/benchmark_serving.py”, line 59, in
from benchmark_dataset import (
File “/vllm-workspace/benchmarks/benchmark_dataset.py”, line 28, in
import pandas as pd
ModuleNotFoundError: No module named ‘pandas’

RunLLM · September 15, 2025, 7:04am

The error means the pandas Python package is not installed in your Docker container. To fix this, run:

pip install pandas

Then rerun your benchmark command. This is a common requirement, as the benchmark scripts depend on pandas for data handling, but it is not always pre-installed in the vLLM Docker image according to GitHub Issue #19078.

Would you like more troubleshooting steps or help with other missing dependencies?

Sources:

GitHub Issue #19078

youlearnit · September 15, 2025, 7:05am

File “/vllm-workspace/benchmarks/benchmark_serving.py”, line 59, in
from benchmark_dataset import (
File “/vllm-workspace/benchmarks/benchmark_dataset.py”, line 29, in
from datasets import load_dataset
ModuleNotFoundError: No module named ‘datasets’

RunLLM · September 15, 2025, 7:05am

You are missing the datasets package, which is required by the benchmark scripts. Install it with:

pip install datasets

This is expected, as benchmark dependencies like datasets are not always included by default in the vLLM Docker image, per GitHub Issue #19078.

Would you like a list of all recommended benchmark dependencies?

Sources:

GitHub Issue #19078

youlearnit · September 15, 2025, 7:06am

INFO 09-15 00:05:56 [init.py:241] Automatically detected platform cuda.
usage: benchmark_serving.py [-h] [–backend {tgi,vllm,lmdeploy,deepspeed-mii,openai,openai-chat,openai-audio,tensorrt-llm,scalellm,sglang,llama.cpp}] [–base-url BASE_URL] [–host HOST] [–port PORT]
[–endpoint ENDPOINT] [–dataset-name {sharegpt,burstgpt,sonnet,random,hf,custom}] [–dataset-path DATASET_PATH] [–no-stream] [–max-concurrency MAX_CONCURRENCY] --model MODEL
[–tokenizer TOKENIZER] [–use-beam-search] [–num-prompts NUM_PROMPTS] [–logprobs LOGPROBS] [–request-rate REQUEST_RATE] [–burstiness BURSTINESS] [–seed SEED]
[–trust-remote-code] [–disable-tqdm] [–profile] [–save-result] [–save-detailed] [–append-result] [–metadata [KEY=VALUE …]] [–result-dir RESULT_DIR]
[–result-filename RESULT_FILENAME] [–ignore-eos] [–percentile-metrics PERCENTILE_METRICS] [–metric-percentiles METRIC_PERCENTILES] [–goodput GOODPUT [GOODPUT …]]
[–custom-output-len CUSTOM_OUTPUT_LEN] [–custom-skip-chat-template] [–sonnet-input-len SONNET_INPUT_LEN] [–sonnet-output-len SONNET_OUTPUT_LEN]
[–sonnet-prefix-len SONNET_PREFIX_LEN] [–sharegpt-output-len SHAREGPT_OUTPUT_LEN] [–random-input-len RANDOM_INPUT_LEN] [–random-output-len RANDOM_OUTPUT_LEN]
[–random-range-ratio RANDOM_RANGE_RATIO] [–random-prefix-len RANDOM_PREFIX_LEN] [–hf-subset HF_SUBSET] [–hf-split HF_SPLIT] [–hf-output-len HF_OUTPUT_LEN] [–top-p TOP_P]
[–top-k TOP_K] [–min-p MIN_P] [–temperature TEMPERATURE] [–tokenizer-mode {auto,slow,mistral,custom}] [–served-model-name SERVED_MODEL_NAME]
[–lora-modules LORA_MODULES [LORA_MODULES …]] [–ramp-up-strategy {linear,exponential}] [–ramp-up-start-rps RAMP_UP_START_RPS] [–ramp-up-end-rps RAMP_UP_END_RPS]
benchmark_serving.py: error: the following arguments are required: --model

RunLLM · September 15, 2025, 7:06am

The error means you must specify the --model argument when running benchmark_serving.py. For example:

python3 benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --num-prompts 100 --max-concurrency 32 --model <your_model_name>

Replace <your_model_name> with the name or path of your model. This is a required argument as documented in the benchmarking API.

Would you like help choosing the correct model name or path?

Sources:

youlearnit · September 15, 2025, 7:09am

The model is already loaded into gpu vram and its running in this same docker.

Anyway this just does not work:

DeprecationWarning: benchmark_serving.py is deprecated and will be removed in a future version. Please use ‘vllm bench serve’ instead.
main(args)
Namespace(backend=‘openai’, base_url=‘http://localhost:8000’, host=‘127.0.0.1’, port=8000, endpoint=‘/v1/completions’, dataset_name=‘sharegpt’, dataset_path=None, no_stream=False, max_concurrency=32, model=‘google/gemma-3-12b-it’, tokenizer=None, use_beam_search=False, num_prompts=100, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=‘ttft,tpot,itl’, metric_percentiles=‘99’, goodput=None, custom_output_len=256, custom_skip_chat_template=False, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode=‘auto’, served_model_name=None, lora_modules=None, ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None)
Traceback (most recent call last):
File “/vllm-workspace/benchmarks/benchmark_serving.py”, line 1305, in
main(args)
File “/usr/local/lib/python3.12/dist-packages/typing_extensions.py”, line 2956, in wrapper
return arg(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File “/vllm-workspace/benchmarks/benchmark_serving.py”, line 780, in main
input_requests = dataset_mappingargs.dataset_name
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/vllm-workspace/benchmarks/benchmark_serving.py”, line 759, in
“sharegpt”: lambda: ShareGPTDataset(
^^^^^^^^^^^^^^^^
File “/vllm-workspace/benchmarks/benchmark_dataset.py”, line 384, in init
self.load_data()
File “/vllm-workspace/benchmarks/benchmark_dataset.py”, line 388, in load_data
raise ValueError(“dataset_path must be provided for loading data.”)
ValueError: dataset_path must be provided for loading data.

RunLLM · September 15, 2025, 7:09am

The error is because you set --dataset-name sharegpt but did not provide --dataset-path. The ShareGPT dataset requires a path to a JSON file (e.g., ShareGPT_V3_unfiltered_cleaned_split.json). You must download the dataset and specify its path:

python3 benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --num-prompts 100 --max-concurrency 32 --model google/gemma-3-12b-it --dataset-name sharegpt --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json

Alternatively, to run a synthetic benchmark without a dataset file, use --dataset-name random (no --dataset-path needed). The script is deprecated; the recommended command is now vllm bench serve as described in the official docs.

Would you like a full example using vllm bench serve?

Sources:

benchmarks/README.md

youlearnit · September 15, 2025, 7:10am

How to run benchmark when the docker container and vllm is already running and the model is taking all vram?

RunLLM · September 15, 2025, 7:10am

You can run the benchmark directly inside the same Docker container where vLLM is serving the model, even if the model is using all GPU VRAM. The benchmark script (vllm bench serve or python3 benchmarks/benchmark_serving.py) only sends HTTP requests to the running vLLM server and does not load the model itself, so it does not require additional GPU memory. Just ensure you have the required Python dependencies installed (e.g., pandas, datasets).

Example (inside the container):

vllm bench serve --backend openai --base-url http://localhost:8000 --model <served_model_name> --num-prompts 100 --max-concurrency 32 --dataset-name random

This approach is documented in the official benchmarking guide and confirmed in GitHub Issue #13598.

Would you like a step-by-step example for your setup?

Sources:

GitHub Issue #13598

youlearnit · September 15, 2025, 7:31am

trying to server local model for the benchmark while inside the vllm docker:

vllm bench serve
–backend openai
–base-url http://localhost:8000
–endpoint /v1/chat/completions
–model vllm/gemma-3
–num-prompts 100
–max-concurrency 32
–dataset-name random

OSError: vllm/gemma-3 is not a local folder and is not a valid model identifier listed on ‘https://huggingface.co/models’
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with hf auth login or by passing token=<your_token>

I run the docker with
-v /home/ubuntu/vllm_models:/root/.cache/huggingface

so the models are outside of the docker in the host

RunLLM · September 15, 2025, 7:32am

The error means vllm/gemma-3 is not a valid Hugging Face repo or a direct path to a local model directory. When serving a local model, you must provide the absolute path to the model directory containing config.json (e.g., /root/.cache/huggingface/gemma-3 or /root/.cache/huggingface/hub/models--<org>--<model>/snapshots/<hash>). Just mounting /home/ubuntu/vllm_models is not enough; you must use the correct path inside the container.

Example:

vllm bench serve --backend openai --base-url http://localhost:8000 --endpoint /v1/chat/completions --model /root/.cache/huggingface/gemma-3 --num-prompts 100 --max-concurrency 32 --dataset-name random

If you downloaded the model with huggingface-cli, the path is usually /root/.cache/huggingface/hub/models--<org>--<model>/snapshots/<hash>. See this issue and this comment for details.

Would you like help finding the exact model path inside your container?

Sources:

youlearnit · September 15, 2025, 7:33am

huggingface_hub.errors.HFValidationError: Repo id must be in the form ‘repo_name’ or ‘namespace/repo_name’: ‘/root/.cache/huggingface/gemma-3’. Use repo_type argument if needed.

RunLLM · September 15, 2025, 7:34am

This error means /root/.cache/huggingface/gemma-3 is not the correct directory—vLLM expects the path to the model snapshot directory that contains config.json, not just a parent folder. For Hugging Face-downloaded models, the correct path is usually:

/root/.cache/huggingface/hub/models--<org>--<model>/snapshots/<hash>

Make sure you point --model to this snapshot directory, not a higher-level folder. See GitHub Issue #9459 for details.

Would you like help locating the exact snapshot path for your model?

Sources:

GitHub Issue #9459

youlearnit · September 15, 2025, 7:34am

yes,
the model is already runnin. why it is so hard to find the model path?

youlearnit · September 15, 2025, 8:10am

with this command:

vllm bench serve --backend openai --base-url http://localhost:8000 --endpoint /v1/chat/completions --model google/gemma-3-12b-it --served-model-name vllm/gemma-3 --num-prompts 100 --max-concurrency 32 --dataset-name random

I get this:

File “/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/serve.py”, line 396, in benchmark
raise ValueError(
ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Bad Request

RunLLM · September 15, 2025, 8:10am

This error usually means the --model argument in your benchmark command does not match the --served-model-name used when you started the vLLM server. The model name in your benchmark request must exactly match the name the server is serving under, not the Hugging Face repo or local path. For example, if you started the server with:

vllm serve /root/.cache/huggingface/hub/models--google--gemma-3-12b-it/snapshots/<hash> --served-model-name vllm/gemma-3

then your benchmark command should use:

vllm bench serve --backend openai --base-url http://localhost:8000 --endpoint /v1/chat/completions --model vllm/gemma-3 --num-prompts 100 --max-concurrency 32 --dataset-name random

If the names do not match, you will get a “Bad Request” or “Not Found” error, as confirmed in GitHub Issue #19743 and GitHub Issue #19047.

Would you like step-by-step instructions to verify your served model name?

Sources:

youlearnit · September 15, 2025, 8:12am

That exactly does not work. This is how I started the vllm:

docker run -it --gpus all --ipc=host --network host -p 8000:8000 -v /home/ubuntu/vllm_models:/root/.cache/huggingface -e HUGGING_FACE_HUB_TOKEN=“” --name vllm_download_and_serve10 vllm/vllm-openai:latest --model google/gemma-3-12b-it --served-model-name vllm/gemma-3 --max-model-len 2048 --max-num-seqs 128 --gpu-memory-utilization 0.96

and that works from outside.

Topic		Replies	Views
Benchmarking vllm performance - Having trouble sending requests to vllm servers General	1	33	August 18, 2025
Vllm bench serve + Bearer API key + HTTPS Benchmarking	1	99	August 7, 2025
Running vllm bench serve from CPU-only node Benchmarking	3	293	August 29, 2025
Vllm bench serve Order of "generated_texts" Benchmarking	16	8	October 6, 2025
使用以下2种方式，获得的结果有很大差异 Benchmarking	50	555	July 25, 2025

How to benchmark concurrency

Related topics