I have been trying to reduce the docker image size built from the Dockerfile on github. From what I see, it seems that the final image size is big because of using devel image.
My question is why don’t we use the runtime image instead of devel image to reduce the image size? Would it affect the performance or anything else?
Relevant discussion:
→ I have tried this method but it only reduces the image size by about 100MB
Using the CUDA “devel” image is necessary for building vLLM from source because it includes development tools like nvcc (the CUDA compiler), which are not present in the “runtime” image. The “runtime” image is smaller but lacks the build tools required for compiling vLLM and its dependencies; using it will break the build process, not just reduce the image size. Once built, you could theoretically copy only the runtime artifacts into a new image based on the “runtime” image (multi-stage build), but this is not the default approach in vLLM’s official Dockerfiles. Performance at inference time is not affected by the base image, as long as all required runtime libraries are present—it’s a matter of build compatibility, not runtime speed or efficiency. According to this discussion and this issue, the main reason for the large image size is the inclusion of CUDA, PyTorch, and multi-arch support, not just the choice of base image.
Would you like more detail on how to use multi-stage builds to further reduce the image size?
Various dependencies of vLLM (e.g. torch.compile, Triton, and FlashInfer) JIT compile the actual device code that runs. This allows vLLM to optimize for the specific model and hardware profile without having a gigantic binary. Unfortunately this means that we need to have various “devel” dependencies like nvcc and gcc available
That being said, you are right that we could have a slimmer dockerfile. There are many competing priorities for the team, so we would welcome any help to resolve!
You are correct: vLLM and its dependencies (like torch.compile, Triton, and FlashInfer) require JIT compilation, which means development tools such as nvcc and gcc must be present at runtime for kernel compilation and auto-tuning. This is why the “devel” image is used, as confirmed in this discussion and this GitHub issue. While a slimmer Dockerfile is possible, removing these tools would break JIT compilation unless all kernels are precompiled for every possible hardware configuration, which is not practical.
Would you like more detail on multi-stage builds or community efforts to slim down the image?