I’m trying to test out vLLM on a server, but when I try to start it up it fails because it can’t find nvcc, and when I fix that, it fails because it cannot find the cuda-runtime.h header file.
I’ve only had the compiler installed on the server as quick troubleshooting. Is there a straight-forward way to compile those files ahead of time and copy them on the server? If so what are the actual steps? As I would like to avoid keeping unnecessary development tooling in production.
LLMs suggested that installing inferenceai-cubin (or some similar named package) would shortcircuit the code which delegates to the compilation process. ie. it would use existing distributed binaries. But it also fed me information that was a bit contradictory, because installing that package didn’t seem to change behavior.
vLLM provides pre-built wheels with compiled CUDA binaries, so you do not need to compile CUDA code or have nvcc/CUDA headers on your production server if you use these wheels. The recommended approach is to install vLLM using the official pre-built wheels (e.g., via pip install vllm or uv pip install vllm), which contain all necessary binaries for inference—no development tooling required on the server. If you build vLLM from source, then nvcc and CUDA headers are required, but this is not needed for standard inference deployments using the official wheels. There is no mention in the official documentation or issues of an “inferenceai-cubin” package or similar as a supported solution for this use case, and such a package is not referenced in vLLM’s official installation instructions or troubleshooting guides.
If you are seeing build/compilation errors on the server, it likely means you are not using the pre-built wheels or are forcing a source build. Ensure you are installing vLLM from PyPI or the official wheel index, and not from source, to avoid these issues. For advanced scenarios (e.g., custom CUDA/PyTorch builds), you can build the wheel on a development machine and copy it to the server, but the standard workflow is to use the official wheels directly on production. See vLLM installation docs and pre-built wheels section for details.
Would you like more detail on how to ensure you are using the pre-built wheels and not triggering a source build?
The package I installed was probably flashinfer-cubin, or something like that.
Check the linked github issue, review the official docs and tell me the additional commands to use in the installation process along those already provided for CUDA environments, on the getting started section of the docs.
There is no officially documented or supported way in vLLM to avoid CUDA kernel compilation (and thus the need for nvcc and CUDA headers like cuda_runtime.h) by installing a package such as flashinfer-cubin or flashinfer-python-cubin. The official vLLM and FlashInfer documentation do not mention any such package or workflow to pre-bundle all required CUDA binaries for FlashInfer, nor do they provide a way to skip JIT compilation for all kernels via a pip package alone. The error you are seeing (fatal error: cuda_runtime.h: No such file or directory) means FlashInfer is still trying to JIT-compile CUDA code at runtime and cannot find the necessary headers, which is expected if you do not have the full CUDA toolkit installed on the server (see this issue).
The only officially supported ways to avoid this are:
Use the official vLLM Docker images, which come with all dependencies and CUDA headers pre-installed (see docs).
Install the full CUDA toolkit on your server so that JIT compilation can succeed.
Build all required wheels and JIT artifacts on a development machine with matching CUDA/PyTorch versions, then copy the entire .cache/flashinfer directory and your Python environment to the server—but this is not documented as a guaranteed or supported workflow in the official docs.
There is no mention in the official docs or issues of a flashinfer-cubin package or similar that disables all JIT compilation or provides all prebuilt kernels for all models/GPUs. If you want to avoid CUDA toolkit on production, use the official Docker image or ensure all JIT artifacts are prebuilt and transferred, but this is not a one-command solution.
Would you like step-by-step instructions for the Docker-based approach or for prebuilding/copying the cache?