Running vllm bench serve from CPU-only node

Hi all,

Wer are currently serving models via vllm on several GPU nodes. Now, I would like to benchmark these models using vllm bench serve from a CPU-only node. As far as I understood, this is only sending requests to the specified endpoint which might be located on a completely different GPU node.

Unfortunately, my attempts fail because the command complains after running the platform discovery as the plaform does not have a GPU (cuda)

Details:
```

DEBUG 08-06 12:43:13 [init.py:31] No plugins for group vllm.platform_plugins found. DEBUG 08-06 12:43:13 [init.py:35] Checking if TPU platform is available. DEBUG 08-06 12:43:13 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-06 12:43:13 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-06 12:43:13 [init.py:76] Exception happens when checking CUDA platform: NVML Shared Library Not Found
DEBUG 08-06 12:43:13 [init.py:93] CUDA platform is not available because: NVML Shared Library Not Found
DEBUG 08-06 12:43:13 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-06 12:43:13 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-06 12:43:13 [init.py:121] Checking if HPU platform is available.
DEBUG 08-06 12:43:13 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-06 12:43:13 [init.py:138] Checking if XPU platform is available.
DEBUG 08-06 12:43:13 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-06 12:43:13 [init.py:155] Checking if CPU platform is available.
DEBUG 08-06 12:43:13 [init.py:177] Checking if Neuron platform is available.
INFO 08-06 12:43:13 [init.py:248] No platform detected, vLLM is running on UnspecifiedPlatform
DEBUG 08-06 12:43:17 [utils.py:155] Setting VLLM_WORKER_MULTIPROC_METHOD to ‘spawn’
DEBUG 08-06 12:43:17 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-06 12:43:17 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-06 12:43:17 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
DEBUG 08-06 12:43:17 [config.py:2071] Disabled the custom all-reduce kernel because it is not supported on current platform.

Traceback (most recent call last):
File “/usr/local/bin/vllm”, line 10, in
sys.exit(main())
^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py”, line 57, in main
cmd.subparser_init(subparsers).set_defaults(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py”, line 99, in subparser_init
serve_parser = make_arg_parser(serve_parser)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/cli_args.py”, line 266, in make_arg_parser
parser = AsyncEngineArgs.add_cli_args(parser)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 1717, in add_cli_args
parser = EngineArgs.add_cli_args(parser)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 906, in add_cli_args
vllm_kwargs = get_kwargs(VllmConfig)
^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 285, in get_kwargs
return copy.deepcopy(_compute_kwargs(cls))
^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 189, in _compute_kwargs
default = field.default_factory()
^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py”, line 123, in init
s.pydantic_validator.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
File “/usr/local/lib/python3.12/dist-packages/vllm/config.py”, line 2413, in post_init
raise RuntimeError(
RuntimeError: Failed to infer device type, please set the environment variable VLLM_LOGGING_LEVEL=DEBUG to turn on verbose logging to help debug the issue.
```

Any way, I can make that work quickly? Or do I have to build a special CPU version by hand?

Das direkte Ausführen von vllm bench serve auf einem reinen CPU-Node schlägt fehl, weil vLLM beim Start eine Plattform-Erkennung durchführt und ohne GPU- oder CPU-Backend in einer “UnspecifiedPlatform” landet. Das führt zu einem Fehler beim Device-Typ, noch bevor überhaupt Requests an den entfernten GPU-Server gesendet werden können. Dieses Verhalten ist bekannt und wurde mehrfach in Issues diskutiert. Die Benchmarks sind so implementiert, dass sie vLLM lokal initialisieren, auch wenn sie nur als Client fungieren sollen – ein reiner HTTP-Client-Modus ohne Plattform-Check ist aktuell nicht vorgesehen. Siehe z.B. GitHub Issue #17825 und Troubleshooting-Doku.

Laut Dokumentation und den verlinkten Issues ist ein Workaround, vLLM auf dem Benchmark-Node mit CPU-Backend zu installieren (siehe Installationsanleitung CPU), damit die Plattform-Erkennung erfolgreich ist. Dann kann vllm bench serve als Client genutzt werden, um Requests an den entfernten GPU-Server zu schicken. Ein reiner Benchmark-Client ohne vLLM-Backend ist derzeit nicht offiziell unterstützt.

Möchten Sie eine Schritt-für-Schritt-Anleitung zur CPU-Installation und zum Benchmarking-Workflow?

Sources: