Current vLLM docker image size is 12.64Gb, how to reduce it?

Current vLLM-openai docker image size is 12.64Gb, which is quite big.

I am wondering if there is any way to reduce the image size, e.g. for specific environments (targeting a specific GPU such as Nvidia A100 or architecture), or removing dependencies not needed for production.

The vLLM Docker image is large mainly due to bundled dependencies like PyTorch, CUDA libraries, and support for multiple GPU architectures. To reduce image size, you can build a custom image targeting only your specific GPU architecture (e.g., A100) by setting the --build-arg torch_cuda_arch_list="8.0" during the build, and pruning optional or development dependencies. You can also start from the official image and add only the production dependencies you need, or use a multi-stage build to separate build-time and runtime layers. See the official docs for custom Docker builds and pruning tips and example Dockerfiles.

Would you like a step-by-step example Dockerfile for a minimal, A100-only production image?

Sources:

yes, a minimal A100 only production image would be ideal. Do you have an estimate on how much would that reduce image size?

It seems likely that targeting only the A100 (compute capability 8.0) and removing optional/dev dependencies can reduce the vLLM Docker image size by several GB—potentially down to 6–8GB, though exact savings depend on the base image and which dependencies you exclude. Most of the size comes from CUDA, PyTorch, and multi-arch support, so limiting to a single arch helps significantly, but getting below 5GB is considered unrealistic due to the size of core libraries (discussion, docs).

Here’s a minimal example Dockerfile for A100-only production use:

FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

# Install Python and pip
RUN apt-get update && apt-get install -y python3 python3-pip && \
    apt-get clean && rm -rf /var/lib/apt/lists/*

# Install only vLLM and production dependencies
RUN pip3 install --no-cache-dir vllm

# Set CUDA arch for A100 (8.0)
ENV TORCH_CUDA_ARCH_LIST="8.0"

# Expose API port
EXPOSE 8000

ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

For even more reduction, you can build vLLM from source with --build-arg torch_cuda_arch_list="8.0" (docs). Would you like a more advanced multi-stage build example?

Sources: