Why using devel image to build final vLLM docker image?

Using the CUDA “devel” image is necessary for building vLLM from source because it includes development tools like nvcc (the CUDA compiler), which are not present in the “runtime” image. The “runtime” image is smaller but lacks the build tools required for compiling vLLM and its dependencies; using it will break the build process, not just reduce the image size. Once built, you could theoretically copy only the runtime artifacts into a new image based on the “runtime” image (multi-stage build), but this is not the default approach in vLLM’s official Dockerfiles. Performance at inference time is not affected by the base image, as long as all required runtime libraries are present—it’s a matter of build compatibility, not runtime speed or efficiency. According to this discussion and this issue, the main reason for the large image size is the inclusion of CUDA, PyTorch, and multi-arch support, not just the choice of base image.

Would you like more detail on how to use multi-stage builds to further reduce the image size?

Sources: