vLLM docker image For MAC CPU Silicon

Pigcanflysohigh · September 9, 2025, 10:11am

hello everyone,

I want to run vLLM on my MACBookPro M2Pro.

By reading the Offical Document, It said I have to build vLLM from source codes to run it on Apple CPU M2Pro

While My MAC Operating System version is too low, can not build it from source codes. then I want to download a vLLM docker image for MAC CPU M2Pro, and run vLLM in docker container

But I did not found that image on DOCKER HUB

So, I seek some help here : where can i download vllm docker image for MAC M2pro

RunLLM · September 9, 2025, 10:12am

There is currently no official vLLM Docker image for Mac (Apple Silicon) CPU on Docker Hub. The official vLLM images on Docker Hub (vllm/vllm-openai) are built for Linux/amd64 and require NVIDIA GPUs; they do not support Apple Silicon CPUs out of the box. For Apple Silicon (M1/M2/M3), the official documentation and maintainers state you must build vLLM from source on your Mac, as there are no pre-built wheels or images for this device (docs, github_file).

There are some unofficial, community-built images (e.g., on Docker Hub by individual users), but there is no formal, supported vLLM CPU image for Mac M2Pro. The only official CPU image is published to AWS ECR (public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.7.2), but this is for Linux/amd64, not Mac ARM (github_issue). If you want to run vLLM on your Mac M2Pro, you must upgrade your OS to meet the build requirements and build from source, or attempt to build a Docker image yourself using the provided Dockerfiles and cross-compilation instructions (docs, github_file).

Would you like more detail on how to build vLLM from source or create a custom Docker image for Apple Silicon?

Sources:

hackyroot · September 10, 2025, 8:49am

I’m also facing the similar issue.
Today I’m planning to build a docker image and publish it on docker hub. Will share it here once I’m done. Hope that helps.

RunLLM · September 10, 2025, 8:49am

There is currently no official vLLM Docker image for Mac (Apple Silicon) CPUs on Docker Hub, and the maintainers recommend building from source for Apple Silicon (docs). If you successfully build and publish a working Docker image for Mac M2/M3 CPUs, it would be helpful to share the Dockerfile and image link with the community, as others have requested this but no official or widely shared solution exists (github_issue).

Would you like step-by-step guidance on building such a Docker image for Apple Silicon?

Sources:

Pigcanflysohigh · September 15, 2025, 8:24am

please telnet me the image name after pushing it to Docker Hub, thank U!!!

hackyroot · September 15, 2025, 3:13pm

Sure thing @Pigcanflysohigh! I’ve been to build the image locally but it’s failing at my end as well. Will try again on a bigger machine this week.

Will let you know once I’m done.

Pigcanflysohigh · September 16, 2025, 7:04am

firest time I built it with command：docker build -f docker/Dockerfile.cpu --tag vllm-cpu-env .
failed，and the output is：ERROR: failed to solve: ResourceExhausted: process “/bin/sh -c VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel” did not complete successfully: cannot allocate memory

second time，I build it with：docker build -f docker/Dockerfile.cpu --tag vllm-cpu-env . --build-arg MAKEFLAGS=“-j1”
while I got the same output as the first time。

I seek the reason by gpt，that said it is because of the memory is not enough

RunLLM · September 16, 2025, 7:05am

Your error, ResourceExhausted: ... cannot allocate memory, means your system ran out of RAM during the vLLM Docker build. This is a common issue, especially on machines with limited memory, as vLLM’s build process is memory-intensive (github_issue, github_issue, docs).

To resolve this, try adding swap space or reducing parallelism by setting MAX_JOBS=1 (not MAKEFLAGS), e.g.:

export MAX_JOBS=1
docker build -f docker/Dockerfile.cpu --tag vllm-cpu-env .

Alternatively, add swap to your system or build on a machine with more RAM (github_issue). Would you like more detailed troubleshooting steps?

Sources:

Pigcanflysohigh · September 16, 2025, 9:43am

thankU，

I have built the image sucessfully，about 2.54G

RunLLM · September 16, 2025, 9:43am

Great to hear you built the image successfully! A size of about 2.54G is typical for a vLLM CPU Docker image (github_pull_request). If you plan to share it, consider pushing it to Docker Hub for others who need Apple Silicon support.

Would you like tips on running vLLM in the container or optimizing the image further?

Sources:

github_pull_request

Pigcanflysohigh · September 16, 2025, 9:52am

I still have a question

-e VLLM_CPU_OMP_THREADS_BIND=

Apple Silicon ，is a unit include CPU and GPU, I am not sure if vLLM run base on CPU or GPU? AND what is the reason?

RunLLM · September 16, 2025, 9:52am

On Apple Silicon (M1/M2/M3), vLLM only supports running on the CPU, not the GPU. The environment variable VLLM_CPU_OMP_THREADS_BIND is used to control which CPU cores vLLM’s OpenMP threads are bound to; it does not affect GPU usage (docs, docs).

The reason is that vLLM does not support Apple’s Metal (MPS) GPU backend—only CPU inference is available on Mac. All computation is performed on the CPU, and the GPU is not used (github_issue, github_pull_request).

Would you like more detail on thread binding or vLLM’s device support?

Sources:

Pigcanflysohigh · September 17, 2025, 9:05am

I hava got another trouble，

docker run --rm \ --privileged=true \ --shm-size=4g \ -p 8000:8000 \ -e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \ -e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \ vllm-cpu-env \ --model=meta-llama/Llama-3.2-1B-Instruct \ --dtype=bfloat16 \ other vLLM OpenAI server arguments

this is the method in offical doc how to run the container on my MAC, i am not sure how to config the two argumentsVLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> and VLLM_CPU_KVCACHE_SPACE=<KV cache space>.

can you give a example?

VLLM_CPU_OMP_THREADS_BIND shoud write every core or the number of cores?

VLLM_CPU_KVCACHE_SPACE what is the unit of this arg? G or M or K ?

RunLLM · September 17, 2025, 9:06am

For Apple Silicon (or any CPU backend), here’s how to set those variables:

VLLM_CPU_OMP_THREADS_BIND should be a list or range of CPU core IDs (not just a number). For example, to use cores 0–7:
-e VLLM_CPU_OMP_THREADS_BIND=0-7
This binds 8 OpenMP threads to logical CPU cores 0 through 7 (docs).
VLLM_CPU_KVCACHE_SPACE sets the KV cache size in GiB (not MB or KB). For example, to allocate 4 GiB:
-e VLLM_CPU_KVCACHE_SPACE=4
(docs)

Example full command:

docker run --rm \
  --privileged=true \
  --shm-size=4g \
  -p 8000:8000 \
  -e VLLM_CPU_KVCACHE_SPACE=4 \
  -e VLLM_CPU_OMP_THREADS_BIND=0-7 \
  vllm-cpu-env \
  --model=meta-llama/Llama-3.2-1B-Instruct \
  --dtype=bfloat16

Would you like more detail on how to choose the best values for your hardware?

Sources:

Pigcanflysohigh · September 17, 2025, 9:21am

what about the follow Error logs

NotImplementedError: Could not run '_C::onednn_mm' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit ``https://fburl.com/ptmfixes`` for possible resolutions. '_C::onednn_mm' is only available for these backends: [Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMTIA, AutogradMAIA, AutogradMeta, Tracer, AutocastCPU, AutocastMTIA, AutocastMAIA, AutocastXPU, AutocastMPS, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP0': 1}

RunLLM · September 17, 2025, 9:22am

This error means the vLLM build is missing the required oneDNN (DNNL) backend for the _C::onednn_mm operator, so the operator cannot run on your CPU. This usually happens if oneDNN was not installed or not properly linked during the vLLM build process (github_issue, github_issue).

To fix:

Install oneDNN (DNNL) and its development headers before building vLLM, e.g.
sudo apt-get install -y libdnnl-dev
Then rebuild vLLM from source so it can find and use oneDNN.

Would you like more detailed build instructions for your platform?

Sources:

Pigcanflysohigh · September 18, 2025, 8:21am

first question: my System is MAC System, not Ubuntu. should i use apt-get to install dnn?

second question: I have built vLLM from source in my computer, if i go to install dnn, can i built vLLM again?

the third question: how to install dnn in MAC System? and how to check if i install it sucessfully?

RunLLM · September 18, 2025, 8:22am

On macOS, you cannot use apt-get. Instead, use Homebrew to install oneDNN:

brew install onednn

(oneDNN Homebrew formula)

Yes, after installing oneDNN, you should rebuild vLLM from source so it can detect and link to oneDNN (github_issue).
To check if oneDNN is installed, run:

brew list onednn

You should see the installed files. You can also check the include and lib directories, e.g., /opt/homebrew/include/dnnl.h and /opt/homebrew/lib/libdnnl.dylib.

Would you like step-by-step rebuild instructions for vLLM on macOS?

Sources:

github_issue

Pigcanflysohigh · September 18, 2025, 9:18am

I have installd oneDNN by brew sucessfully, and built vLLM again, then built a new docker images. But when I run the container again, I got the same Error Logs

NotImplementedError: Could not run '_C::onednn_mm' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit ``https://fburl.com/ptmfixes`` for possible resolutions. '_C::onednn_mm' is only available for these backends: [Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMTIA, AutogradMAIA, AutogradMeta, Tracer, AutocastCPU, AutocastMTIA, AutocastMAIA, AutocastXPU, AutocastMPS, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP0': 1}

RunLLM · September 18, 2025, 9:19am

On macOS, vLLM does not officially support the oneDNN backend, and the _C::onednn_mm operator is not built for the Apple Silicon CPU backend—even if you install oneDNN with Homebrew (github_issue, docs). The vLLM build system is designed for oneDNN on Linux/x86, not macOS, so the operator is omitted during compilation.

This means you will continue to see this error on Mac, and there is currently no supported workaround to enable oneDNN or the _C::onednn_mm operator on Apple Silicon. vLLM on Mac only supports the basic CPU backend, not the optimized oneDNN path (github_issue).

Would you like more detail on supported features or possible alternatives?

Sources:

vLLM oneDNN operator error discussion

Topic		Replies	Views
Need help running VLLM on macOS General	3	189	November 2, 2025
Which software components vLLM inference needs General	3	79	December 30, 2025
Current vLLM docker image size is 12.64Gb, how to reduce it? General	3	626	July 28, 2025
Make install easier General	11	415	July 24, 2025
Use editable vllm for different GPU types General	1	67	August 31, 2025

vLLM docker image For MAC CPU Silicon

Related topics