Running vllm bench serve from CPU-only node

jklinkenberg · August 6, 2025, 10:51am

Hi all,

Wer are currently serving models via vllm on several GPU nodes. Now, I would like to benchmark these models using vllm bench serve from a CPU-only node. As far as I understood, this is only sending requests to the specified endpoint which might be located on a completely different GPU node.

Unfortunately, my attempts fail because the command complains after running the platform discovery as the plaform does not have a GPU (cuda)

Details:
```

DEBUG 08-06 12:43:13 [init.py:31] No plugins for group vllm.platform_plugins found. DEBUG 08-06 12:43:13 [init.py:35] Checking if TPU platform is available. DEBUG 08-06 12:43:13 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-06 12:43:13 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-06 12:43:13 [init.py:76] Exception happens when checking CUDA platform: NVML Shared Library Not Found
DEBUG 08-06 12:43:13 [init.py:93] CUDA platform is not available because: NVML Shared Library Not Found
DEBUG 08-06 12:43:13 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-06 12:43:13 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-06 12:43:13 [init.py:121] Checking if HPU platform is available.
DEBUG 08-06 12:43:13 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-06 12:43:13 [init.py:138] Checking if XPU platform is available.
DEBUG 08-06 12:43:13 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-06 12:43:13 [init.py:155] Checking if CPU platform is available.
DEBUG 08-06 12:43:13 [init.py:177] Checking if Neuron platform is available.
INFO 08-06 12:43:13 [init.py:248] No platform detected, vLLM is running on UnspecifiedPlatform
DEBUG 08-06 12:43:17 [utils.py:155] Setting VLLM_WORKER_MULTIPROC_METHOD to ‘spawn’
DEBUG 08-06 12:43:17 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-06 12:43:17 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-06 12:43:17 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
DEBUG 08-06 12:43:17 [config.py:2071] Disabled the custom all-reduce kernel because it is not supported on current platform.

Traceback (most recent call last):
File “/usr/local/bin/vllm”, line 10, in
sys.exit(main())
^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py”, line 57, in main
cmd.subparser_init(subparsers).set_defaults(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py”, line 99, in subparser_init
serve_parser = make_arg_parser(serve_parser)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/cli_args.py”, line 266, in make_arg_parser
parser = AsyncEngineArgs.add_cli_args(parser)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 1717, in add_cli_args
parser = EngineArgs.add_cli_args(parser)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 906, in add_cli_args
vllm_kwargs = get_kwargs(VllmConfig)
^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 285, in get_kwargs
return copy.deepcopy(_compute_kwargs(cls))
^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 189, in _compute_kwargs
default = field.default_factory()
^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py”, line 123, in init
s.pydantic_validator.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
File “/usr/local/lib/python3.12/dist-packages/vllm/config.py”, line 2413, in post_init
raise RuntimeError(
RuntimeError: Failed to infer device type, please set the environment variable VLLM_LOGGING_LEVEL=DEBUG to turn on verbose logging to help debug the issue.
```

Any way, I can make that work quickly? Or do I have to build a special CPU version by hand?

RunLLM · August 6, 2025, 10:52am

Das direkte Ausführen von vllm bench serve auf einem reinen CPU-Node schlägt fehl, weil vLLM beim Start eine Plattform-Erkennung durchführt und ohne GPU- oder CPU-Backend in einer “UnspecifiedPlatform” landet. Das führt zu einem Fehler beim Device-Typ, noch bevor überhaupt Requests an den entfernten GPU-Server gesendet werden können. Dieses Verhalten ist bekannt und wurde mehrfach in Issues diskutiert. Die Benchmarks sind so implementiert, dass sie vLLM lokal initialisieren, auch wenn sie nur als Client fungieren sollen – ein reiner HTTP-Client-Modus ohne Plattform-Check ist aktuell nicht vorgesehen. Siehe z.B. GitHub Issue #17825 und Troubleshooting-Doku.

Laut Dokumentation und den verlinkten Issues ist ein Workaround, vLLM auf dem Benchmark-Node mit CPU-Backend zu installieren (siehe Installationsanleitung CPU), damit die Plattform-Erkennung erfolgreich ist. Dann kann vllm bench serve als Client genutzt werden, um Requests an den entfernten GPU-Server zu schicken. Ein reiner Benchmark-Client ohne vLLM-Backend ist derzeit nicht offiziell unterstützt.

Möchten Sie eine Schritt-für-Schritt-Anleitung zur CPU-Installation und zum Benchmarking-Workflow?

Sources:

Aydin-ab · August 29, 2025, 8:20pm

Same issue here. Trying to run vllm bench serve from a CPU head node will fail. The script does work (benchmarking_serve.py). For context, I’m deploying my model with Ray Serve LLM (via serve run)

RunLLM · August 29, 2025, 8:21pm

You are correct: running vllm bench serve on a CPU-only node fails because vLLM’s CLI expects a valid device backend (GPU or supported CPU), and the platform detection fails if neither is present, resulting in a RuntimeError. However, running the Python script benchmarks/benchmark_serving.py directly works, since it does not require vLLM’s CLI platform checks and acts as a pure HTTP client, sending requests to your Ray Serve LLM endpoint. This behavior is confirmed in GitHub Issue #17825 and related discussions.

In summary, for benchmarking from a CPU-only node, use the Python script (benchmark_serving.py) directly, not vllm bench serve, as the latter requires a supported backend even for client-only operations. Would you like a step-by-step example for using benchmark_serving.py with Ray Serve LLM?

Sources:

Topic		Replies	Views
Does vllm support inference or service startup of CPU small model? Hardware Support	3	94	May 30, 2025
How to benchmark concurrency General	26	101	September 15, 2025
"vllm serve" to use all GPU's of a single-node multi-use General	5	211	August 1, 2025
Vllm bench serve + Bearer API key + HTTPS Benchmarking	1	100	August 7, 2025
Benchmarking vllm performance - Having trouble sending requests to vllm servers General	1	33	August 18, 2025

Running vllm bench serve from CPU-only node

Related topics