Build VLLM without installing gcc?

Hi,

I’m building VLLM on a redhat ubi8-minimal container and one of the things that I’ve had to do was to microdnf install gcc-c++ in the Dockerfile. Unfortunately gcc seems to rely on kernel-headers to compile its things, and VLLM does not seem to start up correctly if there is no gcc… The problem with kernel-headers is that it introduces a lot of vulnerabilities, and I’m wondering if I can get away with not installing gcc at all?

Please see below for the runtime error I face when I start up vllm, is there a way I can configure vllm to prevent it from using the C compiler?

INFO 06-20 01:58:58 [init.py:244] Automatically detected platform cuda.
vllm | INFO 06-20 01:59:01 [api_server.py:1287] vLLM API server version 0.9.2.dev169+gea10dd9d9
vllm | INFO 06-20 01:59:01 [cli_args.py:309] non-default args: {‘host’: ‘0.0.0.0’, ‘enable_auto_tool_choice’: True, ‘tool_call_parser’: ‘llama3_json’, ‘model’: ‘/meta/llama3.1-8b’, ‘max_model_len’: 12000, ‘gpu_memory_utilization’: 0.95}
vllm | INFO 06-20 01:59:09 [config.py:831] This model supports multiple tasks: {‘reward’, ‘classify’, ‘score’, ‘generate’, ‘embed’}. Defaulting to ‘generate’.
vllm | INFO 06-20 01:59:09 [config.py:1444] Using max model len 12000
vllm | INFO 06-20 01:59:10 [awq_marlin.py:116] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
vllm | INFO 06-20 01:59:10 [config.py:2197] Chunked prefill is enabled with max_num_batched_tokens=2048.
vllm | INFO 06-20 01:59:14 [init.py:244] Automatically detected platform cuda.
vllm | INFO 06-20 01:59:17 [core.py:459] Waiting for init message from front-end.
vllm | INFO 06-20 01:59:17 [core.py:69] Initializing a V1 LLM engine (v0.9.2.dev169+gea10dd9d9) with config: model=‘/meta/llama3.1-8b’, speculative_config=None, tokenizer=‘/meta/llama3.1-8b’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=12000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=‘’), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/meta/llama3.1-8b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={“level”:3,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:[“none”],“splitting_ops”:[“vllm.unified_attention”,“vllm.unified_attention_with_output”],“use_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“use_cudagraph”:true,“cudagraph_num_of_warmups”:1,“cudagraph_capture_sizes”:[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“max_capture_size”:512,“local_cache_dir”:null}
vllm | Process EngineCore_0:
vllm | ERROR 06-20 01:59:17 [core.py:519] EngineCore failed to start.
vllm | ERROR 06-20 01:59:17 [core.py:519] Traceback (most recent call last):
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/vllm/v1/engine/core.py”, line 510, in run_engine_core
vllm | ERROR 06-20 01:59:17 [core.py:519] engine_core = EngineCoreProc(*args, **kwargs)
vllm | ERROR 06-20 01:59:17 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/vllm/v1/engine/core.py”, line 394, in init
vllm | ERROR 06-20 01:59:17 [core.py:519] super().init(vllm_config, executor_class, log_stats,
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/vllm/v1/engine/core.py”, line 75, in init
vllm | ERROR 06-20 01:59:17 [core.py:519] self.model_executor = executor_class(vllm_config)
vllm | ERROR 06-20 01:59:17 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/vllm/executor/executor_base.py”, line 53, in init
vllm | ERROR 06-20 01:59:17 [core.py:519] self._init_executor()
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/vllm/executor/uniproc_executor.py”, line 46, in _init_executor
vllm | ERROR 06-20 01:59:17 [core.py:519] self.collective_rpc(“init_worker”, args=([kwargs], ))
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/vllm/executor/uniproc_executor.py”, line 57, in collective_rpc
vllm | ERROR 06-20 01:59:17 [core.py:519] answer = run_method(self.driver_worker, method, args, kwargs)
vllm | ERROR 06-20 01:59:17 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/vllm/utils.py”, line 2690, in run_method
vllm | ERROR 06-20 01:59:17 [core.py:519] return func(*args, **kwargs)
vllm | ERROR 06-20 01:59:17 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/vllm/worker/worker_base.py”, line 558, in init_worker
vllm | ERROR 06-20 01:59:17 [core.py:519] worker_class = resolve_obj_by_qualname(
vllm | ERROR 06-20 01:59:17 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/vllm/utils.py”, line 2258, in resolve_obj_by_qualname
vllm | ERROR 06-20 01:59:17 [core.py:519] module = importlib.import_module(module_name)
vllm | ERROR 06-20 01:59:17 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/lib64/python3.12/importlib/init.py”, line 90, in import_module
vllm | ERROR 06-20 01:59:17 [core.py:519] return _bootstrap._gcd_import(name[level:], package, level)
vllm | ERROR 06-20 01:59:17 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | ERROR 06-20 01:59:17 [core.py:519] File “”, line 1387, in _gcd_import
vllm | ERROR 06-20 01:59:17 [core.py:519] File “”, line 1360, in _find_and_load
vllm | ERROR 06-20 01:59:17 [core.py:519] File “”, line 1331, in _find_and_load_unlocked
vllm | ERROR 06-20 01:59:17 [core.py:519] File “”, line 935, in _load_unlocked
vllm | ERROR 06-20 01:59:17 [core.py:519] File “”, line 999, in exec_module
vllm | ERROR 06-20 01:59:17 [core.py:519] File “”, line 488, in _call_with_frames_removed
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/vllm/v1/worker/gpu_worker.py”, line 29, in
vllm | ERROR 06-20 01:59:17 [core.py:519] from vllm.v1.worker.gpu_model_runner import GPUModelRunner
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 33, in
vllm | ERROR 06-20 01:59:17 [core.py:519] from vllm.model_executor.layers.mamba.mamba_mixer2 import MambaMixer2
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/vllm/model_executor/layers/mamba/mamba_mixer2.py”, line 25, in
vllm | ERROR 06-20 01:59:17 [core.py:519] from vllm.model_executor.layers.mamba.ops.ssd_combined import (
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/ssd_combined.py”, line 15, in
vllm | ERROR 06-20 01:59:17 [core.py:519] from .ssd_bmm import _bmm_chunk_fwd
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/ssd_bmm.py”, line 16, in
vllm | ERROR 06-20 01:59:17 [core.py:519] @triton.autotune(
vllm | ERROR 06-20 01:59:17 [core.py:519] ^^^^^^^^^^^^^^^^
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/triton/runtime/autotuner.py”, line 378, in decorator
vllm | ERROR 06-20 01:59:17 [core.py:519] return Autotuner(fn, fn.arg_names, configs, key, reset_to_zero, restore_value, pre_hook=pre_hook,
vllm | ERROR 06-20 01:59:17 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/triton/runtime/autotuner.py”, line 130, in init
vllm | ERROR 06-20 01:59:17 [core.py:519] self.do_bench = driver.active.get_benchmarker()
vllm | ERROR 06-20 01:59:17 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/triton/runtime/driver.py”, line 23, in getattr
vllm | ERROR 06-20 01:59:17 [core.py:519] self._initialize_obj()
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/triton/runtime/driver.py”, line 20, in _initialize_obj
vllm | ERROR 06-20 01:59:17 [core.py:519] self._obj = self._init_fn()
vllm | ERROR 06-20 01:59:17 [core.py:519] ^^^^^^^^^^^^^^^
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/triton/runtime/driver.py”, line 9, in _create_driver
vllm | ERROR 06-20 01:59:17 [core.py:519] return actives0
vllm | ERROR 06-20 01:59:17 [core.py:519] ^^^^^^^^^^^^
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/triton/backends/nvidia/driver.py”, line 535, in init
vllm | ERROR 06-20 01:59:17 [core.py:519] self.utils = CudaUtils() # TODO: make static
vllm | ERROR 06-20 01:59:17 [core.py:519] ^^^^^^^^^^^
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/triton/backends/nvidia/driver.py”, line 89, in init
vllm | ERROR 06-20 01:59:17 [core.py:519] mod = compile_module_from_src(Path(os.path.join(dirname, “driver.c”)).read_text(), “cuda_utils”)
vllm | ERROR 06-20 01:59:17 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/triton/backends/nvidia/driver.py”, line 66, in compile_module_from_src
vllm | ERROR 06-20 01:59:17 [core.py:519] so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
vllm | ERROR 06-20 01:59:17 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | ERROR 06-20 01:59:17 [core.py:519] File “/usr/local/lib64/python3.12/site-packages/triton/runtime/build.py”, line 18, in _build
vllm | ERROR 06-20 01:59:17 [core.py:519] raise RuntimeError(“Failed to find C compiler. Please specify via CC environment variable.”)
vllm | ERROR 06-20 01:59:17 [core.py:519] RuntimeError: Failed to find C compiler. Please specify via CC environment variable.
vllm | Traceback (most recent call last):
vllm | File “/usr/lib64/python3.12/multiprocessing/process.py”, line 314, in _bootstrap
vllm | self.run()
vllm | File “/usr/lib64/python3.12/multiprocessing/process.py”, line 108, in run
vllm | self._target(*self._args, **self._kwargs)
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/v1/engine/core.py”, line 523, in run_engine_core
vllm | raise e
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/v1/engine/core.py”, line 510, in run_engine_core
vllm | engine_core = EngineCoreProc(*args, **kwargs)
vllm | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/v1/engine/core.py”, line 394, in init
vllm | super().init(vllm_config, executor_class, log_stats,
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/v1/engine/core.py”, line 75, in init
vllm | self.model_executor = executor_class(vllm_config)
vllm | ^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/executor/executor_base.py”, line 53, in init
vllm | self._init_executor()
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/executor/uniproc_executor.py”, line 46, in _init_executor
vllm | self.collective_rpc(“init_worker”, args=([kwargs], ))
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/executor/uniproc_executor.py”, line 57, in collective_rpc
vllm | answer = run_method(self.driver_worker, method, args, kwargs)
vllm | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/utils.py”, line 2690, in run_method
vllm | return func(*args, **kwargs)
vllm | ^^^^^^^^^^^^^^^^^^^^^
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/worker/worker_base.py”, line 558, in init_worker
vllm | worker_class = resolve_obj_by_qualname(
vllm | ^^^^^^^^^^^^^^^^^^^^^^^^
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/utils.py”, line 2258, in resolve_obj_by_qualname
vllm | module = importlib.import_module(module_name)
vllm | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | File “/usr/lib64/python3.12/importlib/init.py”, line 90, in import_module
vllm | return _bootstrap._gcd_import(name[level:], package, level)
vllm | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | File “”, line 1387, in _gcd_import
vllm | File “”, line 1360, in _find_and_load
vllm | File “”, line 1331, in _find_and_load_unlocked
vllm | File “”, line 935, in _load_unlocked
vllm | File “”, line 999, in exec_module
vllm | File “”, line 488, in _call_with_frames_removed
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/v1/worker/gpu_worker.py”, line 29, in
vllm | from vllm.v1.worker.gpu_model_runner import GPUModelRunner
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 33, in
vllm | from vllm.model_executor.layers.mamba.mamba_mixer2 import MambaMixer2
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/model_executor/layers/mamba/mamba_mixer2.py”, line 25, in
vllm | from vllm.model_executor.layers.mamba.ops.ssd_combined import (
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/ssd_combined.py”, line 15, in
vllm | from .ssd_bmm import _bmm_chunk_fwd
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/ssd_bmm.py”, line 16, in
vllm | @triton.autotune(
vllm | ^^^^^^^^^^^^^^^^
vllm | File “/usr/local/lib64/python3.12/site-packages/triton/runtime/autotuner.py”, line 378, in decorator
vllm | return Autotuner(fn, fn.arg_names, configs, key, reset_to_zero, restore_value, pre_hook=pre_hook,
vllm | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | File “/usr/local/lib64/python3.12/site-packages/triton/runtime/autotuner.py”, line 130, in init
vllm | self.do_bench = driver.active.get_benchmarker()
vllm | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | File “/usr/local/lib64/python3.12/site-packages/triton/runtime/driver.py”, line 23, in getattr
vllm | self._initialize_obj()
vllm | File “/usr/local/lib64/python3.12/site-packages/triton/runtime/driver.py”, line 20, in _initialize_obj
vllm | self._obj = self._init_fn()
vllm | ^^^^^^^^^^^^^^^
vllm | File “/usr/local/lib64/python3.12/site-packages/triton/runtime/driver.py”, line 9, in _create_driver
vllm | return actives0
vllm | ^^^^^^^^^^^^
vllm | File “/usr/local/lib64/python3.12/site-packages/triton/backends/nvidia/driver.py”, line 535, in init
vllm | self.utils = CudaUtils() # TODO: make static
vllm | ^^^^^^^^^^^
vllm | File “/usr/local/lib64/python3.12/site-packages/triton/backends/nvidia/driver.py”, line 89, in init
vllm | mod = compile_module_from_src(Path(os.path.join(dirname, “driver.c”)).read_text(), “cuda_utils”)
vllm | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | File “/usr/local/lib64/python3.12/site-packages/triton/backends/nvidia/driver.py”, line 66, in compile_module_from_src
vllm | so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
vllm | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | File “/usr/local/lib64/python3.12/site-packages/triton/runtime/build.py”, line 18, in _build
vllm | raise RuntimeError(“Failed to find C compiler. Please specify via CC environment variable.”)
vllm | RuntimeError: Failed to find C compiler. Please specify via CC environment variable.
vllm | Traceback (most recent call last):
vllm | File “”, line 198, in _run_module_as_main
vllm | File “”, line 88, in _run_code
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py”, line 1387, in
vllm | uvloop.run(run_server(args))
vllm | File “/usr/local/lib64/python3.12/site-packages/uvloop/init.py”, line 109, in run
vllm | return __asyncio.run(
vllm | ^^^^^^^^^^^^^^
vllm | File “/usr/lib64/python3.12/asyncio/runners.py”, line 195, in run
vllm | return runner.run(main)
vllm | ^^^^^^^^^^^^^^^^
vllm | File “/usr/lib64/python3.12/asyncio/runners.py”, line 118, in run
vllm | return self._loop.run_until_complete(task)
vllm | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | File “uvloop/loop.pyx”, line 1518, in uvloop.loop.Loop.run_until_complete
vllm | File “/usr/local/lib64/python3.12/site-packages/uvloop/init.py”, line 61, in wrapper
vllm | return await main
vllm | ^^^^^^^^^^
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py”, line 1323, in run_server
vllm | await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py”, line 1343, in run_server_worker
vllm | async with build_async_engine_client(args, client_config) as engine_client:
vllm | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | File “/usr/lib64/python3.12/contextlib.py”, line 210, in aenter
vllm | return await anext(self.gen)
vllm | ^^^^^^^^^^^^^^^^^^^^^
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py”, line 155, in build_async_engine_client
vllm | async with build_async_engine_client_from_engine_args(
vllm | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | File “/usr/lib64/python3.12/contextlib.py”, line 210, in aenter
vllm | return await anext(self.gen)
vllm | ^^^^^^^^^^^^^^^^^^^^^
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py”, line 191, in build_async_engine_client_from_engine_args
vllm | async_llm = AsyncLLM.from_vllm_config(
vllm | ^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/v1/engine/async_llm.py”, line 162, in from_vllm_config
vllm | return cls(
vllm | ^^^^
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/v1/engine/async_llm.py”, line 124, in init
vllm | self.engine_core = EngineCoreClient.make_async_mp_client(
vllm | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py”, line 93, in make_async_mp_client
vllm | return AsyncMPClient(vllm_config, executor_class, log_stats,
vllm | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py”, line 716, in init
vllm | super().init(
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py”, line 422, in init
vllm | self._init_engines_direct(vllm_config, local_only,
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py”, line 491, in _init_engines_direct
vllm | self._wait_for_engine_startup(handshake_socket, input_address,
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py”, line 511, in _wait_for_engine_startup
vllm | wait_for_engine_startup(
vllm | File “/usr/local/lib64/python3.12/site-packages/vllm/v1/utils.py”, line 494, in wait_for_engine_startup
vllm | raise RuntimeError("Engine core initialization failed. "
vllm | RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
vllm exited with code 0

You cannot currently run vLLM (with CUDA) without a C compiler present at runtime, because Triton JIT kernels require a C compiler to build CUDA code on the fly. This is a known issue and has been discussed by the vLLM community; the only workaround is to install a C compiler (e.g., gcc) in your container, which unfortunately may require kernel headers and can introduce security concerns. There is no official configuration or flag in vLLM to avoid this requirement for CUDA-based inference as of now. See the discussion and error details in this issue.

Some users mention that using the --enforce-eager flag can avoid the C compiler requirement, but this disables CUDA graph optimizations and may reduce performance. There is ongoing work in the community to support ahead-of-time (AOT) compilation or Triton Deja-vu, but this is not yet available in mainline vLLM. For now, a C compiler is required for production CUDA inference with vLLM. See also this comment and related discussions.

Would you like more detail on the technical reasons or possible workarounds?

Sources: