Hi all,
Wer are currently serving models via vllm on several GPU nodes. Now, I would like to benchmark these models using vllm bench serve
from a CPU-only node. As far as I understood, this is only sending requests to the specified endpoint which might be located on a completely different GPU node.
Unfortunately, my attempts fail because the command complains after running the platform discovery as the plaform does not have a GPU (cuda)
Details:
```
DEBUG 08-06 12:43:13 [init.py:31] No plugins for group vllm.platform_plugins found. DEBUG 08-06 12:43:13 [init.py:35] Checking if TPU platform is available. DEBUG 08-06 12:43:13 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-06 12:43:13 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-06 12:43:13 [init.py:76] Exception happens when checking CUDA platform: NVML Shared Library Not Found
DEBUG 08-06 12:43:13 [init.py:93] CUDA platform is not available because: NVML Shared Library Not Found
DEBUG 08-06 12:43:13 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-06 12:43:13 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-06 12:43:13 [init.py:121] Checking if HPU platform is available.
DEBUG 08-06 12:43:13 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-06 12:43:13 [init.py:138] Checking if XPU platform is available.
DEBUG 08-06 12:43:13 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-06 12:43:13 [init.py:155] Checking if CPU platform is available.
DEBUG 08-06 12:43:13 [init.py:177] Checking if Neuron platform is available.
INFO 08-06 12:43:13 [init.py:248] No platform detected, vLLM is running on UnspecifiedPlatform
DEBUG 08-06 12:43:17 [utils.py:155] Setting VLLM_WORKER_MULTIPROC_METHOD to ‘spawn’
DEBUG 08-06 12:43:17 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-06 12:43:17 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-06 12:43:17 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS
to control which plugins to load.
DEBUG 08-06 12:43:17 [config.py:2071] Disabled the custom all-reduce kernel because it is not supported on current platform.
Traceback (most recent call last):
File “/usr/local/bin/vllm”, line 10, in
sys.exit(main())
^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py”, line 57, in main
cmd.subparser_init(subparsers).set_defaults(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py”, line 99, in subparser_init
serve_parser = make_arg_parser(serve_parser)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/cli_args.py”, line 266, in make_arg_parser
parser = AsyncEngineArgs.add_cli_args(parser)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 1717, in add_cli_args
parser = EngineArgs.add_cli_args(parser)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 906, in add_cli_args
vllm_kwargs = get_kwargs(VllmConfig)
^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 285, in get_kwargs
return copy.deepcopy(_compute_kwargs(cls))
^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 189, in _compute_kwargs
default = field.default_factory()
^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py”, line 123, in init
s.pydantic_validator.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
File “/usr/local/lib/python3.12/dist-packages/vllm/config.py”, line 2413, in post_init
raise RuntimeError(
RuntimeError: Failed to infer device type, please set the environment variable VLLM_LOGGING_LEVEL=DEBUG
to turn on verbose logging to help debug the issue.
```
Any way, I can make that work quickly? Or do I have to build a special CPU version by hand?