Does vllm support inference or service startup of CPU small model?

safwaqf · May 29, 2025, 6:50pm

vllm serve /dfs/data/model/bge-m3/ --task embed --trust-remote-code --max-model-len 8192

Using the above command to start in the CPU environment will result in an error：

Traceback (most recent call last):
File “/dfs/data/miniconda/envs/Qwen2-New/bin/vllm”, line 8, in
sys.exit(main())
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py”, line 53, in main
args.dispatch_function(args)
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py”, line 27, in cmd
uvloop.run(run_server(args))
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/site-packages/uvloop/init.py”, line 82, in run
return loop.run_until_complete(wrapper())
File “uvloop/loop.pyx”, line 1517, in uvloop.loop.Loop.run_until_complete
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/site-packages/uvloop/init.py”, line 61, in wrapper
return await main
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py”, line 1078, in run_server
async with build_async_engine_client(args) as engine_client:
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/contextlib.py”, line 199, in aenter
return await anext(self.gen)
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py”, line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/contextlib.py”, line 199, in aenter
return await anext(self.gen)
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py”, line 166, in build_async_engine_client_from_engine_args
vllm_config = engine_args.create_engine_config(usage_context=usage_context)
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/site-packages/vllm/engine/arg_utils.py”, line 1098, in create_engine_config
device_config = DeviceConfig(device=self.device)
File “”, line 4, in init
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/site-packages/vllm/config.py”, line 2119, in post_init
raise RuntimeError(
RuntimeError: Failed to infer device type, please set the environment variable VLLM_LOGGING_LEVEL=DEBUG to turn on verbose logging to help debug the issue.

add --device Still will report an error:
Traceback (most recent call last):
File “/dfs/data/miniconda/envs/Qwen2-New/bin/vllm”, line 8, in
sys.exit(main())
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py”, line 53, in main
args.dispatch_function(args)
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py”, line 27, in cmd
uvloop.run(run_server(args))
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/site-packages/uvloop/init.py”, line 82, in run
return loop.run_until_complete(wrapper())
File “uvloop/loop.pyx”, line 1517, in uvloop.loop.Loop.run_until_complete
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/site-packages/uvloop/init.py”, line 61, in wrapper
return await main
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py”, line 1078, in run_server
async with build_async_engine_client(args) as engine_client:
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/contextlib.py”, line 199, in aenter
return await anext(self.gen)
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py”, line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/contextlib.py”, line 199, in aenter
return await anext(self.gen)
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py”, line 166, in build_async_engine_client_from_engine_args
vllm_config = engine_args.create_engine_config(usage_context=usage_context)
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/site-packages/vllm/engine/arg_utils.py”, line 1273, in create_engine_config
config = VllmConfig(
File “”, line 19, in init
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/site-packages/vllm/config.py”, line 3848, in post_init
self.model_config.verify_async_output_proc(self.parallel_config,
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/site-packages/vllm/config.py”, line 902, in verify_async_output_proc
if not current_platform.is_async_output_supported(self.enforce_eager):
File “/dfs/data/miniconda/envs/Qwen2-New/lib/python3.10/site-packages/vllm/platforms/interface.py”, line 214, in is_async_output_supported
raise NotImplementedError
NotImplementedError

Does vllm support inference or service startup of CPU small model?

RunLLM · May 29, 2025, 6:51pm

vLLM目前对CPU推理支持有限，官方没有提供预编译的CPU版本wheel或镜像，必须从源码编译CPU后端才能在CPU上运行，并且部分功能和模型可能尚不完善。你遇到的“Failed to infer device type”及“NotImplementedError”错误，正是因为未正确编译CPU后端或平台检测失败导致。需使用VLLM_TARGET_DEVICE=cpu python setup.py install从源码编译安装，不能直接用pip安装或官方docker镜像，否则无法在CPU环境下正常启动服务。详细官方说明和相关issue均有说明。

此外，部分issue反馈即使加了–device cpu参数，未用源码编译的vLLM依然会报错。建议严格按照官方CPU安装文档操作，确保依赖和编译环境满足要求。如果你已源码编译仍报错，可设置环境变量VLLM_LOGGING_LEVEL=DEBUG获取更多日志排查。官方Troubleshooting文档也有相关说明。

需要更详细的编译步骤或排查建议吗？

Sources:

safwaqf · May 30, 2025, 6:45am

Okay, thank you for your answer.

RunLLM · May 30, 2025, 6:45am

不客气！如需进一步了解vLLM在CPU上的编译、部署或排错细节，欢迎随时提问。祝你顺利运行模型！

Sources:

Topic		Replies	Views
Running vllm bench serve from CPU-only node Benchmarking	1	46	August 6, 2025
Installation issue General	8	91	July 3, 2025
Build VLLM without installing gcc? General	1	162	June 20, 2025
Why vLLM uses alot of CPU memory General	1	585	April 21, 2025
Make install easier General	11	89	July 24, 2025

Does vllm support inference or service startup of CPU small model?

Related topics