Hi,
How should I benchmark vLLM docker setup with 2 GPUs in ubuntu 24.04. I want to know how many concurrent users in certain prompt it can handle?
Hi,
How should I benchmark vLLM docker setup with 2 GPUs in ubuntu 24.04. I want to know how many concurrent users in certain prompt it can handle?
To benchmark your vLLM Docker setup with 2 GPUs and measure how many concurrent users it can handle for a given prompt, use the built-in benchmark scripts included in the vLLM Docker image. Start the vLLM server with your model and desired parallelism (e.g., --tensor-parallel-size 2
), then run the benchmark_serving.py
script from inside the container to simulate concurrent requests and measure throughput, latency, and concurrency. You can control the number of concurrent requests using the --max-concurrency
flag in the benchmark script, which was added for this purpose. For example:
python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --num-prompts 100 --max-concurrency 32
This will report metrics such as request throughput, token throughput, and latency, helping you determine the maximum concurrency your setup can handle for your prompt size. For more details, see the official benchmarking documentation and max concurrency support. Benchmarks are included in the Docker image, so you do not need to clone the repo separately unless using a custom image.
Would you like a step-by-step example for your specific model and prompt?
Sources:
ubuntu@ubuntuAIhome01:~$ docker exec -it vllm_download_and_serve10 bash
root@ubuntuAIhome01:/vllm-workspace# python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --num-prompts 100 --max-concurrency 32
bash: python: command not found
root@ubuntuAIhome01:/vllm-workspace# python3 benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --num-prompts 100 --max-concurrency 32
Traceback (most recent call last):
File “/vllm-workspace/benchmarks/benchmark_serving.py”, line 59, in
from benchmark_dataset import (
File “/vllm-workspace/benchmarks/benchmark_dataset.py”, line 28, in
import pandas as pd
ModuleNotFoundError: No module named ‘pandas’
The error means the pandas Python package is not installed in your Docker container. To fix this, run:
pip install pandas
Then rerun your benchmark command. This is a common requirement, as the benchmark scripts depend on pandas for data handling, but it is not always pre-installed in the vLLM Docker image according to GitHub Issue #19078.
Would you like more troubleshooting steps or help with other missing dependencies?
Sources:
File “/vllm-workspace/benchmarks/benchmark_serving.py”, line 59, in
from benchmark_dataset import (
File “/vllm-workspace/benchmarks/benchmark_dataset.py”, line 29, in
from datasets import load_dataset
ModuleNotFoundError: No module named ‘datasets’
You are missing the datasets package, which is required by the benchmark scripts. Install it with:
pip install datasets
This is expected, as benchmark dependencies like datasets are not always included by default in the vLLM Docker image, per GitHub Issue #19078.
Would you like a list of all recommended benchmark dependencies?
Sources:
INFO 09-15 00:05:56 [init.py:241] Automatically detected platform cuda.
usage: benchmark_serving.py [-h] [–backend {tgi,vllm,lmdeploy,deepspeed-mii,openai,openai-chat,openai-audio,tensorrt-llm,scalellm,sglang,llama.cpp}] [–base-url BASE_URL] [–host HOST] [–port PORT]
[–endpoint ENDPOINT] [–dataset-name {sharegpt,burstgpt,sonnet,random,hf,custom}] [–dataset-path DATASET_PATH] [–no-stream] [–max-concurrency MAX_CONCURRENCY] --model MODEL
[–tokenizer TOKENIZER] [–use-beam-search] [–num-prompts NUM_PROMPTS] [–logprobs LOGPROBS] [–request-rate REQUEST_RATE] [–burstiness BURSTINESS] [–seed SEED]
[–trust-remote-code] [–disable-tqdm] [–profile] [–save-result] [–save-detailed] [–append-result] [–metadata [KEY=VALUE …]] [–result-dir RESULT_DIR]
[–result-filename RESULT_FILENAME] [–ignore-eos] [–percentile-metrics PERCENTILE_METRICS] [–metric-percentiles METRIC_PERCENTILES] [–goodput GOODPUT [GOODPUT …]]
[–custom-output-len CUSTOM_OUTPUT_LEN] [–custom-skip-chat-template] [–sonnet-input-len SONNET_INPUT_LEN] [–sonnet-output-len SONNET_OUTPUT_LEN]
[–sonnet-prefix-len SONNET_PREFIX_LEN] [–sharegpt-output-len SHAREGPT_OUTPUT_LEN] [–random-input-len RANDOM_INPUT_LEN] [–random-output-len RANDOM_OUTPUT_LEN]
[–random-range-ratio RANDOM_RANGE_RATIO] [–random-prefix-len RANDOM_PREFIX_LEN] [–hf-subset HF_SUBSET] [–hf-split HF_SPLIT] [–hf-output-len HF_OUTPUT_LEN] [–top-p TOP_P]
[–top-k TOP_K] [–min-p MIN_P] [–temperature TEMPERATURE] [–tokenizer-mode {auto,slow,mistral,custom}] [–served-model-name SERVED_MODEL_NAME]
[–lora-modules LORA_MODULES [LORA_MODULES …]] [–ramp-up-strategy {linear,exponential}] [–ramp-up-start-rps RAMP_UP_START_RPS] [–ramp-up-end-rps RAMP_UP_END_RPS]
benchmark_serving.py: error: the following arguments are required: --model
The error means you must specify the --model argument when running benchmark_serving.py. For example:
python3 benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --num-prompts 100 --max-concurrency 32 --model <your_model_name>
Replace <your_model_name>
with the name or path of your model. This is a required argument as documented in the benchmarking API.
Would you like help choosing the correct model name or path?
Sources:
The model is already loaded into gpu vram and its running in this same docker.
Anyway this just does not work:
DeprecationWarning: benchmark_serving.py is deprecated and will be removed in a future version. Please use ‘vllm bench serve’ instead.
main(args)
Namespace(backend=‘openai’, base_url=‘http://localhost:8000’, host=‘127.0.0.1’, port=8000, endpoint=‘/v1/completions’, dataset_name=‘sharegpt’, dataset_path=None, no_stream=False, max_concurrency=32, model=‘google/gemma-3-12b-it’, tokenizer=None, use_beam_search=False, num_prompts=100, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=‘ttft,tpot,itl’, metric_percentiles=‘99’, goodput=None, custom_output_len=256, custom_skip_chat_template=False, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode=‘auto’, served_model_name=None, lora_modules=None, ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None)
Traceback (most recent call last):
File “/vllm-workspace/benchmarks/benchmark_serving.py”, line 1305, in
main(args)
File “/usr/local/lib/python3.12/dist-packages/typing_extensions.py”, line 2956, in wrapper
return arg(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File “/vllm-workspace/benchmarks/benchmark_serving.py”, line 780, in main
input_requests = dataset_mappingargs.dataset_name
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/vllm-workspace/benchmarks/benchmark_serving.py”, line 759, in
“sharegpt”: lambda: ShareGPTDataset(
^^^^^^^^^^^^^^^^
File “/vllm-workspace/benchmarks/benchmark_dataset.py”, line 384, in init
self.load_data()
File “/vllm-workspace/benchmarks/benchmark_dataset.py”, line 388, in load_data
raise ValueError(“dataset_path must be provided for loading data.”)
ValueError: dataset_path must be provided for loading data.
The error is because you set --dataset-name sharegpt
but did not provide --dataset-path
. The ShareGPT dataset requires a path to a JSON file (e.g., ShareGPT_V3_unfiltered_cleaned_split.json
). You must download the dataset and specify its path:
python3 benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --num-prompts 100 --max-concurrency 32 --model google/gemma-3-12b-it --dataset-name sharegpt --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json
Alternatively, to run a synthetic benchmark without a dataset file, use --dataset-name random
(no --dataset-path
needed). The script is deprecated; the recommended command is now vllm bench serve
as described in the official docs.
Would you like a full example using vllm bench serve
?
Sources:
How to run benchmark when the docker container and vllm is already running and the model is taking all vram?
You can run the benchmark directly inside the same Docker container where vLLM is serving the model, even if the model is using all GPU VRAM. The benchmark script (vllm bench serve
or python3 benchmarks/benchmark_serving.py
) only sends HTTP requests to the running vLLM server and does not load the model itself, so it does not require additional GPU memory. Just ensure you have the required Python dependencies installed (e.g., pandas
, datasets
).
Example (inside the container):
vllm bench serve --backend openai --base-url http://localhost:8000 --model <served_model_name> --num-prompts 100 --max-concurrency 32 --dataset-name random
This approach is documented in the official benchmarking guide and confirmed in GitHub Issue #13598.
Would you like a step-by-step example for your setup?
Sources:
trying to server local model for the benchmark while inside the vllm docker:
vllm bench serve
–backend openai
–base-url http://localhost:8000
–endpoint /v1/chat/completions
–model vllm/gemma-3
–num-prompts 100
–max-concurrency 32
–dataset-name random
OSError: vllm/gemma-3 is not a local folder and is not a valid model identifier listed on ‘https://huggingface.co/models’
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with hf auth login
or by passing token=<your_token>
I run the docker with
-v /home/ubuntu/vllm_models:/root/.cache/huggingface
so the models are outside of the docker in the host
The error means vllm/gemma-3
is not a valid Hugging Face repo or a direct path to a local model directory. When serving a local model, you must provide the absolute path to the model directory containing config.json
(e.g., /root/.cache/huggingface/gemma-3
or /root/.cache/huggingface/hub/models--<org>--<model>/snapshots/<hash>
). Just mounting /home/ubuntu/vllm_models
is not enough; you must use the correct path inside the container.
Example:
vllm bench serve --backend openai --base-url http://localhost:8000 --endpoint /v1/chat/completions --model /root/.cache/huggingface/gemma-3 --num-prompts 100 --max-concurrency 32 --dataset-name random
If you downloaded the model with huggingface-cli
, the path is usually /root/.cache/huggingface/hub/models--<org>--<model>/snapshots/<hash>
. See this issue and this comment for details.
Would you like help finding the exact model path inside your container?
Sources:
huggingface_hub.errors.HFValidationError: Repo id must be in the form ‘repo_name’ or ‘namespace/repo_name’: ‘/root/.cache/huggingface/gemma-3’. Use repo_type
argument if needed.
This error means /root/.cache/huggingface/gemma-3
is not the correct directory—vLLM expects the path to the model snapshot directory that contains config.json
, not just a parent folder. For Hugging Face-downloaded models, the correct path is usually:
/root/.cache/huggingface/hub/models--<org>--<model>/snapshots/<hash>
Make sure you point --model
to this snapshot directory, not a higher-level folder. See GitHub Issue #9459 for details.
Would you like help locating the exact snapshot path for your model?
Sources:
yes,
the model is already runnin. why it is so hard to find the model path?
with this command:
vllm bench serve --backend openai --base-url http://localhost:8000 --endpoint /v1/chat/completions --model google/gemma-3-12b-it --served-model-name vllm/gemma-3 --num-prompts 100 --max-concurrency 32 --dataset-name random
I get this:
File “/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/serve.py”, line 396, in benchmark
raise ValueError(
ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Bad Request
This error usually means the --model
argument in your benchmark command does not match the --served-model-name
used when you started the vLLM server. The model name in your benchmark request must exactly match the name the server is serving under, not the Hugging Face repo or local path. For example, if you started the server with:
vllm serve /root/.cache/huggingface/hub/models--google--gemma-3-12b-it/snapshots/<hash> --served-model-name vllm/gemma-3
then your benchmark command should use:
vllm bench serve --backend openai --base-url http://localhost:8000 --endpoint /v1/chat/completions --model vllm/gemma-3 --num-prompts 100 --max-concurrency 32 --dataset-name random
If the names do not match, you will get a “Bad Request” or “Not Found” error, as confirmed in GitHub Issue #19743 and GitHub Issue #19047.
Would you like step-by-step instructions to verify your served model name?
Sources:
That exactly does not work. This is how I started the vllm:
docker run -it --gpus all --ipc=host --network host -p 8000:8000 -v /home/ubuntu/vllm_models:/root/.cache/huggingface -e HUGGING_FACE_HUB_TOKEN=“” --name vllm_download_and_serve10 vllm/vllm-openai:latest --model google/gemma-3-12b-it --served-model-name vllm/gemma-3 --max-model-len 2048 --max-num-seqs 128 --gpu-memory-utilization 0.96
and that works from outside.