How to benchmark concurrency

RunLLM · September 15, 2025, 8:13am

Your vLLM server is running with --served-model-name vllm/gemma-3, so your benchmark command must use --model vllm/gemma-3 (not the Hugging Face name). Also, make sure you use standard dashes (-), not typographic dashes (–), in your command—otherwise, arguments may not be parsed correctly.

Example benchmark command (inside the container):

vllm bench serve --backend openai --base-url http://localhost:8000 --endpoint /v1/chat/completions --model vllm/gemma-3 --num-prompts 100 --max-concurrency 32 --dataset-name random

If you still get “Bad Request,” double-check for any invisible Unicode characters or copy-paste issues in your command. This exact approach is confirmed to work in GitHub Issue #19743.

Would you like help checking for hidden character issues or a way to verify the served model name via the API?

Sources:

GitHub Issue #19743

youlearnit · September 15, 2025, 8:14am

You seem not to understand, with that command it wont work:

File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py”, line 54, in main
args.dispatch_function(args)
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/benchmark/serve.py”, line 21, in cmd
main(args)
File “/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/serve.py”, line 959, in main
return asyncio.run(main_async(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3.12/asyncio/runners.py”, line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File “/usr/lib/python3.12/asyncio/runners.py”, line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3.12/asyncio/base_events.py”, line 691, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/serve.py”, line 1002, in main_async
tokenizer = get_tokenizer(tokenizer_id,
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/tokenizer.py”, line 239, in get_tokenizer
tokenizer = AutoTokenizer.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py”, line 1049, in from_pretrained
tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py”, line 881, in get_tokenizer_config
resolved_config_file = cached_file(
^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py”, line 321, in cached_file
file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py”, line 511, in cached_files
raise OSError(
OSError: vllm/gemma-3 is not a local folder and is not a valid model identifier listed on ‘https://huggingface.co/models’
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with hf auth login or by passing token=<your_token>

RunLLM · September 15, 2025, 8:14am

Thank you for the detailed traceback. The root cause is: the benchmark client tries to load the tokenizer using the --model value (vllm/gemma-3), but this is not a valid Hugging Face repo or a local directory. The benchmark script needs a valid tokenizer path, not just the served model name.

Solution:
You must explicitly provide the tokenizer path using --tokenizer, pointing to the local directory containing tokenizer_config.json (usually the Hugging Face snapshot directory). For example:

vllm bench serve --backend openai --base-url http://localhost:8000 --endpoint /v1/chat/completions \
  --model vllm/gemma-3 \
  --tokenizer /root/.cache/huggingface/hub/models--google--gemma-3-12b-it/snapshots/<hash> \
  --num-prompts 100 --max-concurrency 32 --dataset-name random

This requirement is confirmed in GitHub Issue #19743 and GitHub Issue #8938.

Would you like a command to automatically find the correct tokenizer path inside your container?

Sources:

youlearnit · September 15, 2025, 8:16am

of course I need that

RunLLM · September 15, 2025, 8:16am

To automatically find the correct tokenizer path inside your container, run this command in your Docker shell:

find /root/.cache/huggingface/hub/models--google--gemma-3-12b-it -type d -name "snapshots" -exec find {} -type d \; | while read d; do if [ -f "$d/tokenizer_config.json" ]; then echo "$d"; fi; done

This will print the full path to the snapshot directory containing tokenizer_config.json, which you should use as the --tokenizer argument in your benchmark command. This approach is based on the Hugging Face cache structure described in GitHub Issue #9459.

Would you like a ready-to-use benchmark command with the discovered path?

Sources:

youlearnit · September 15, 2025, 8:36am

it never worked and after half day trying vllm docker benchmark with your, openai and gemini instructions does not work. Maybe if the model is hosted outside docker in local computer is the reason.

thank you for trying, I will never try again. Waste of time.

RunLLM · September 15, 2025, 8:36am

I’m sorry for your frustration—vLLM benchmarking with Docker and local models can be tricky due to path and environment issues. Mounting the model directory into Docker should work, but the path inside the container must exactly match what vLLM expects (the snapshot directory containing config.json and tokenizer_config.json). Many users have faced similar issues, especially with local models and tokenizer paths, as documented in GitHub Issue #9459 and #8938.

If you ever reconsider, I recommend:

Verifying the exact snapshot path inside the container.
Explicitly passing --tokenizer with the full path.
Using only standard dashes (-) in all commands.

Thank you for your feedback, and sorry again for the inconvenience.

Sources:

GitHub Issue #8938

Topic		Replies	Views
Benchmarking vllm performance - Having trouble sending requests to vllm servers General	1	33	August 18, 2025
Vllm bench serve + Bearer API key + HTTPS Benchmarking	1	100	August 7, 2025
Running vllm bench serve from CPU-only node Benchmarking	3	293	August 29, 2025
Vllm bench serve Order of "generated_texts" Benchmarking	16	8	October 6, 2025
使用以下2种方式，获得的结果有很大差异 Benchmarking	50	556	July 25, 2025

How to benchmark concurrency

Related topics