FlashInfer latest version is not working with vLLM

teddythinh · August 18, 2025, 4:48am

Currently I’m installing FlashInfer with the latest version 0.2.11.post3 but vLLM doesn’t recognize so it falls to this warning:

WARNING 08-18 04:37:12 [topk_topp_sampler.py:36] FlashInfer version >= 0.2.3 required. Falling back to default sampling implementation.

I also tried to install version 0.2.11 but still getting that warning. Finally, I install the minimum version that vLLM requires 0.2.3 and it works. So I’m not sure if the warning message is incorrect or am I missing something? Currently I’m using vLLM 0.10.0.

RunLLM · August 18, 2025, 4:48am

This warning is caused by a version comparison bug in vLLM: string-based checks treated “0.2.10” or “0.2.11” as less than “0.2.3”, so vLLM incorrectly falls back to the default sampler even with newer FlashInfer versions. This has been fixed in recent vLLM code by switching to proper version parsing, but vLLM 0.10.0 still has the bug. Upgrading vLLM to a version with the fix will resolve the issue; otherwise, use FlashInfer 0.2.3 as a workaround for vLLM 0.10.0. See issue #22455 and PR #22314.

Would you like more detail on how to patch your current vLLM version or guidance on upgrading?

Sources:

issue #22455

simon-mo · August 18, 2025, 5:21am

This is fixed in nightly or upcoming v0.10.1

RunLLM · August 18, 2025, 5:21am

Yes, this bug is fixed in the nightly builds and will be included in vLLM v0.10.1. The fix ensures proper version comparison for FlashInfer, so newer versions like 0.2.10 and 0.2.11.post3 are recognized correctly. See issue #22455 and PR #22314.

Would you like instructions on installing the nightly or waiting for v0.10.1?

Sources:

issue #22455

teddythinh · August 18, 2025, 6:25am

Thank you @simon-mo, I can see the comment in the PR.

RunLLM · August 18, 2025, 6:25am

You’re welcome! If you need more details from the PR or help applying the fix, let me know. The relevant discussion and resolution are in PR #22314.

Would you like a summary of the PR or guidance on updating your environment?

Sources:

teddythinh · August 20, 2025, 4:45am

Hi @simon-mo, I got this issue when installing with the latest vLLM 0.10.1 and latest FlashInfer 0.2.12.

(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] EngineCore failed to start.
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] Traceback (most recent call last):
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/flashinfer/jit/cpp_ext.py", line 199, in run_ninja
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     subprocess.run(
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.pyenv/versions/3.11.9/lib/python3.11/subprocess.py", line 571, in run
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     raise CalledProcessError(retcode, process.args,
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] subprocess.CalledProcessError: Command '['ninja', '-v', '-C', '/home/admin/.cache/flashinfer/80/cached_ops', '-f', '/home/admin/.cache/flashinfer/80/cached_ops/sampling/build.ninja']' returned non-zero exit status 1.
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] 
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] The above exception was the direct cause of the following exception:
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] 
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] Traceback (most recent call last):
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 691, in run_engine_core
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 492, in __init__
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 89, in __init__
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     self._initialize_kv_caches(vllm_config)
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 179, in _initialize_kv_caches
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     self.model_executor.determine_available_memory())
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     output = self.collective_rpc("determine_available_memory")
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/utils/__init__.py", line 3007, in run_method
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     return func(*args, **kwargs)
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     return func(*args, **kwargs)
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 244, in determine_available_memory
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     self.model_runner.profile_run()
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2627, in profile_run
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     output = self._dummy_sampler_run(last_hidden_states)
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     return func(*args, **kwargs)
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2470, in _dummy_sampler_run
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     raise e
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2460, in _dummy_sampler_run
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     sampler_output = self.sampler(logits=logits,
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     return self._call_impl(*args, **kwargs)
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     return forward_call(*args, **kwargs)
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/sample/sampler.py", line 68, in forward
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     sampled = self.sample(logits, sampling_metadata)
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/sample/sampler.py", line 135, in sample
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     random_sampled = self.topk_topp_sampler(
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                      ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     return self._call_impl(*args, **kwargs)
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     return forward_call(*args, **kwargs)
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 109, in forward_cuda
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     return flashinfer_sample(logits.contiguous(), k, p, generators)
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 295, in flashinfer_sample
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     next_token_ids = flashinfer.sampling.top_k_top_p_sampling_from_logits(
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/flashinfer/sampling.py", line 980, in top_k_top_p_sampling_from_logits
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     masked_logits = top_k_mask_logits(logits, top_k)
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/flashinfer/sampling.py", line 1300, in top_k_mask_logits
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     return get_sampling_module().top_k_mask_logits(
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/flashinfer/sampling.py", line 47, in get_sampling_module
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     module = gen_sampling_module().build_and_load()
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/flashinfer/jit/core.py", line 147, in build_and_load
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     self.build(verbose, need_lock=False)
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/flashinfer/jit/core.py", line 127, in build
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     run_ninja(jit_env.FLASHINFER_JIT_DIR, self.ninja_path, verbose)
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]   File "/home/admin/.venv/lib/python3.11/site-packages/flashinfer/jit/cpp_ext.py", line 211, in run_ninja
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]     raise RuntimeError(msg) from e
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] RuntimeError: Ninja build failed. Ninja output:
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] ninja: Entering directory `/home/admin/.cache/flashinfer/80/cached_ops'
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] [1/4] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output sampling/renorm.cuda.o.d -DTORCH_EXTENSION_NAME=sampling -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/admin/.pyenv/versions/3.11.9/include/python3.11 -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /usr/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 -O3 -std=c++17 --threads=24 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc/renorm.cu -o sampling/renorm.cuda.o 
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] FAILED: [code=1] sampling/renorm.cuda.o 
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output sampling/renorm.cuda.o.d -DTORCH_EXTENSION_NAME=sampling -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/admin/.pyenv/versions/3.11.9/include/python3.11 -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /usr/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 -O3 -std=c++17 --threads=24 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc/renorm.cu -o sampling/renorm.cuda.o 
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] In file included from /usr/include/crt/math_functions.h:10551,
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                  from /usr/include/crt/common_functions.h:303,
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                  from /usr/include/cuda_runtime.h:118,
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                  from <command-line>:
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] /usr/include/c++/11/cmath:45:15: fatal error: math.h: No such file or directory
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]    45 | #include_next <math.h>
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]       |               ^~~~~~~~
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] compilation terminated.
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] In file included from /usr/include/crt/math_functions.h:10551,
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                  from /usr/include/crt/common_functions.h:303,
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                  from /usr/include/cuda_runtime.h:118,
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                  from <command-line>:
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] /usr/include/c++/11/cmath:45:15: fatal error: math.h: No such file or directory
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]    45 | #include_next <math.h>
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]       |               ^~~~~~~~
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] compilation terminated.
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] fatal   : Could not open input file /tmp/tmpxft_0000193d_00000000-7_renorm.cpp1.ii
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] [2/4] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output sampling/sampling.cuda.o.d -DTORCH_EXTENSION_NAME=sampling -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/admin/.pyenv/versions/3.11.9/include/python3.11 -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /usr/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 -O3 -std=c++17 --threads=24 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc/sampling.cu -o sampling/sampling.cuda.o 
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] FAILED: [code=1] sampling/sampling.cuda.o 
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output sampling/sampling.cuda.o.d -DTORCH_EXTENSION_NAME=sampling -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/admin/.pyenv/versions/3.11.9/include/python3.11 -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /usr/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 -O3 -std=c++17 --threads=24 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc/sampling.cu -o sampling/sampling.cuda.o 
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] In file included from /usr/include/crt/math_functions.h:10551,
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                  from /usr/include/crt/common_functions.h:303,
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                  from /usr/include/cuda_runtime.h:118,
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                  from <command-line>:
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] /usr/include/c++/11/cmath:45:15: fatal error: math.h: No such file or directory
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]    45 | #include_next <math.h>
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]       |               ^~~~~~~~
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] compilation terminated.
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] In file included from /usr/include/crt/math_functions.h:10551,
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                  from /usr/include/crt/common_functions.h:303,
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                  from /usr/include/cuda_runtime.h:118,
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                  from <command-line>:
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] /usr/include/c++/11/cmath:45:15: fatal error: math.h: No such file or directory
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]    45 | #include_next <math.h>
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]       |               ^~~~~~~~
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] compilation terminated.
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] fatal   : Could not open input file /tmp/tmpxft_0000193c_00000000-7_sampling.cpp1.ii
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] [3/4] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output sampling/flashinfer_sampling_ops.cuda.o.d -DTORCH_EXTENSION_NAME=sampling -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/admin/.pyenv/versions/3.11.9/include/python3.11 -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /usr/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 -O3 -std=c++17 --threads=24 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc/flashinfer_sampling_ops.cu -o sampling/flashinfer_sampling_ops.cuda.o 
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] FAILED: [code=1] sampling/flashinfer_sampling_ops.cuda.o 
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output sampling/flashinfer_sampling_ops.cuda.o.d -DTORCH_EXTENSION_NAME=sampling -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/admin/.pyenv/versions/3.11.9/include/python3.11 -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /usr/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 -O3 -std=c++17 --threads=24 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc/flashinfer_sampling_ops.cu -o sampling/flashinfer_sampling_ops.cuda.o 
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] In file included from /usr/include/crt/math_functions.h:10551,
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                  from /usr/include/crt/common_functions.h:303,
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                  from /usr/include/cuda_runtime.h:118,
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                  from <command-line>:
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] /usr/include/c++/11/cmath:45:15: fatal error: math.h: No such file or directory
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]    45 | #include_next <math.h>
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]       |               ^~~~~~~~
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] compilation terminated.
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] In file included from /usr/include/crt/math_functions.h:10551,
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                  from /usr/include/crt/common_functions.h:303,
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                  from /usr/include/cuda_runtime.h:118,
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]                  from <command-line>:
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] /usr/include/c++/11/cmath:45:15: fatal error: math.h: No such file or directory
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]    45 | #include_next <math.h>
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]       |               ^~~~~~~~
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] compilation terminated.
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] fatal   : Could not open input file /tmp/tmpxft_0000193e_00000000-7_flashinfer_sampling_ops.cpp1.ii
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700] ninja: build stopped: subcommand failed.
(EngineCore_0 pid=6321) ERROR 08-20 04:27:30 [core.py:700]

teddythinh · August 20, 2025, 4:46am

The following error log from the previous comment as the forum cannot send more than 50000 words

(EngineCore_0 pid=6321) Process EngineCore_0:
(EngineCore_0 pid=6321) Traceback (most recent call last):
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/flashinfer/jit/cpp_ext.py", line 199, in run_ninja
(EngineCore_0 pid=6321)     subprocess.run(
(EngineCore_0 pid=6321)   File "/home/admin/.pyenv/versions/3.11.9/lib/python3.11/subprocess.py", line 571, in run
(EngineCore_0 pid=6321)     raise CalledProcessError(retcode, process.args,
(EngineCore_0 pid=6321) subprocess.CalledProcessError: Command '['ninja', '-v', '-C', '/home/admin/.cache/flashinfer/80/cached_ops', '-f', '/home/admin/.cache/flashinfer/80/cached_ops/sampling/build.ninja']' returned non-zero exit status 1.
(EngineCore_0 pid=6321) 
(EngineCore_0 pid=6321) The above exception was the direct cause of the following exception:
(EngineCore_0 pid=6321) 
(EngineCore_0 pid=6321) Traceback (most recent call last):
(EngineCore_0 pid=6321)   File "/home/admin/.pyenv/versions/3.11.9/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_0 pid=6321)     self.run()
(EngineCore_0 pid=6321)   File "/home/admin/.pyenv/versions/3.11.9/lib/python3.11/multiprocessing/process.py", line 108, in run
(EngineCore_0 pid=6321)     self._target(*self._args, **self._kwargs)
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 704, in run_engine_core
(EngineCore_0 pid=6321)     raise e
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 691, in run_engine_core
(EngineCore_0 pid=6321)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=6321)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 492, in __init__
(EngineCore_0 pid=6321)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 89, in __init__
(EngineCore_0 pid=6321)     self._initialize_kv_caches(vllm_config)
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 179, in _initialize_kv_caches
(EngineCore_0 pid=6321)     self.model_executor.determine_available_memory())
(EngineCore_0 pid=6321)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
(EngineCore_0 pid=6321)     output = self.collective_rpc("determine_available_memory")
(EngineCore_0 pid=6321)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=6321)     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=6321)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/utils/__init__.py", line 3007, in run_method
(EngineCore_0 pid=6321)     return func(*args, **kwargs)
(EngineCore_0 pid=6321)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_0 pid=6321)     return func(*args, **kwargs)
(EngineCore_0 pid=6321)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 244, in determine_available_memory
(EngineCore_0 pid=6321)     self.model_runner.profile_run()
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2627, in profile_run
(EngineCore_0 pid=6321)     output = self._dummy_sampler_run(last_hidden_states)
(EngineCore_0 pid=6321)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_0 pid=6321)     return func(*args, **kwargs)
(EngineCore_0 pid=6321)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2470, in _dummy_sampler_run
(EngineCore_0 pid=6321)     raise e
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2460, in _dummy_sampler_run
(EngineCore_0 pid=6321)     sampler_output = self.sampler(logits=logits,
(EngineCore_0 pid=6321)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(EngineCore_0 pid=6321)     return self._call_impl(*args, **kwargs)
(EngineCore_0 pid=6321)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(EngineCore_0 pid=6321)     return forward_call(*args, **kwargs)
(EngineCore_0 pid=6321)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/sample/sampler.py", line 68, in forward
(EngineCore_0 pid=6321)     sampled = self.sample(logits, sampling_metadata)
(EngineCore_0 pid=6321)               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/sample/sampler.py", line 135, in sample
(EngineCore_0 pid=6321)     random_sampled = self.topk_topp_sampler(
(EngineCore_0 pid=6321)                      ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(EngineCore_0 pid=6321)     return self._call_impl(*args, **kwargs)
(EngineCore_0 pid=6321)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(EngineCore_0 pid=6321)     return forward_call(*args, **kwargs)
(EngineCore_0 pid=6321)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 109, in forward_cuda
(EngineCore_0 pid=6321)     return flashinfer_sample(logits.contiguous(), k, p, generators)
(EngineCore_0 pid=6321)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 295, in flashinfer_sample
(EngineCore_0 pid=6321)     next_token_ids = flashinfer.sampling.top_k_top_p_sampling_from_logits(
(EngineCore_0 pid=6321)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/flashinfer/sampling.py", line 980, in top_k_top_p_sampling_from_logits
(EngineCore_0 pid=6321)     masked_logits = top_k_mask_logits(logits, top_k)
(EngineCore_0 pid=6321)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/flashinfer/sampling.py", line 1300, in top_k_mask_logits
(EngineCore_0 pid=6321)     return get_sampling_module().top_k_mask_logits(
(EngineCore_0 pid=6321)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/flashinfer/sampling.py", line 47, in get_sampling_module
(EngineCore_0 pid=6321)     module = gen_sampling_module().build_and_load()
(EngineCore_0 pid=6321)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/flashinfer/jit/core.py", line 147, in build_and_load
(EngineCore_0 pid=6321)     self.build(verbose, need_lock=False)
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/flashinfer/jit/core.py", line 127, in build
(EngineCore_0 pid=6321)     run_ninja(jit_env.FLASHINFER_JIT_DIR, self.ninja_path, verbose)
(EngineCore_0 pid=6321)   File "/home/admin/.venv/lib/python3.11/site-packages/flashinfer/jit/cpp_ext.py", line 211, in run_ninja
(EngineCore_0 pid=6321)     raise RuntimeError(msg) from e
(EngineCore_0 pid=6321) RuntimeError: Ninja build failed. Ninja output:
(EngineCore_0 pid=6321) ninja: Entering directory `/home/admin/.cache/flashinfer/80/cached_ops'
(EngineCore_0 pid=6321) [1/4] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output sampling/renorm.cuda.o.d -DTORCH_EXTENSION_NAME=sampling -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/admin/.pyenv/versions/3.11.9/include/python3.11 -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /usr/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 -O3 -std=c++17 --threads=24 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc/renorm.cu -o sampling/renorm.cuda.o 
(EngineCore_0 pid=6321) FAILED: [code=1] sampling/renorm.cuda.o 
(EngineCore_0 pid=6321) /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output sampling/renorm.cuda.o.d -DTORCH_EXTENSION_NAME=sampling -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/admin/.pyenv/versions/3.11.9/include/python3.11 -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /usr/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 -O3 -std=c++17 --threads=24 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc/renorm.cu -o sampling/renorm.cuda.o 
(EngineCore_0 pid=6321) In file included from /usr/include/crt/math_functions.h:10551,
(EngineCore_0 pid=6321)                  from /usr/include/crt/common_functions.h:303,
(EngineCore_0 pid=6321)                  from /usr/include/cuda_runtime.h:118,
(EngineCore_0 pid=6321)                  from <command-line>:
(EngineCore_0 pid=6321) /usr/include/c++/11/cmath:45:15: fatal error: math.h: No such file or directory
(EngineCore_0 pid=6321)    45 | #include_next <math.h>
(EngineCore_0 pid=6321)       |               ^~~~~~~~
(EngineCore_0 pid=6321) compilation terminated.
(EngineCore_0 pid=6321) In file included from /usr/include/crt/math_functions.h:10551,
(EngineCore_0 pid=6321)                  from /usr/include/crt/common_functions.h:303,
(EngineCore_0 pid=6321)                  from /usr/include/cuda_runtime.h:118,
(EngineCore_0 pid=6321)                  from <command-line>:
(EngineCore_0 pid=6321) /usr/include/c++/11/cmath:45:15: fatal error: math.h: No such file or directory
(EngineCore_0 pid=6321)    45 | #include_next <math.h>
(EngineCore_0 pid=6321)       |               ^~~~~~~~
(EngineCore_0 pid=6321) compilation terminated.
(EngineCore_0 pid=6321) fatal   : Could not open input file /tmp/tmpxft_0000193d_00000000-7_renorm.cpp1.ii
(EngineCore_0 pid=6321) [2/4] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output sampling/sampling.cuda.o.d -DTORCH_EXTENSION_NAME=sampling -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/admin/.pyenv/versions/3.11.9/include/python3.11 -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /usr/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 -O3 -std=c++17 --threads=24 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc/sampling.cu -o sampling/sampling.cuda.o 
(EngineCore_0 pid=6321) FAILED: [code=1] sampling/sampling.cuda.o 
(EngineCore_0 pid=6321) /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output sampling/sampling.cuda.o.d -DTORCH_EXTENSION_NAME=sampling -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/admin/.pyenv/versions/3.11.9/include/python3.11 -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /usr/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 -O3 -std=c++17 --threads=24 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc/sampling.cu -o sampling/sampling.cuda.o 
(EngineCore_0 pid=6321) In file included from /usr/include/crt/math_functions.h:10551,
(EngineCore_0 pid=6321)                  from /usr/include/crt/common_functions.h:303,
(EngineCore_0 pid=6321)                  from /usr/include/cuda_runtime.h:118,
(EngineCore_0 pid=6321)                  from <command-line>:
(EngineCore_0 pid=6321) /usr/include/c++/11/cmath:45:15: fatal error: math.h: No such file or directory
(EngineCore_0 pid=6321)    45 | #include_next <math.h>
(EngineCore_0 pid=6321)       |               ^~~~~~~~
(EngineCore_0 pid=6321) compilation terminated.
(EngineCore_0 pid=6321) In file included from /usr/include/crt/math_functions.h:10551,
(EngineCore_0 pid=6321)                  from /usr/include/crt/common_functions.h:303,
(EngineCore_0 pid=6321)                  from /usr/include/cuda_runtime.h:118,
(EngineCore_0 pid=6321)                  from <command-line>:
(EngineCore_0 pid=6321) /usr/include/c++/11/cmath:45:15: fatal error: math.h: No such file or directory
(EngineCore_0 pid=6321)    45 | #include_next <math.h>
(EngineCore_0 pid=6321)       |               ^~~~~~~~
(EngineCore_0 pid=6321) compilation terminated.
(EngineCore_0 pid=6321) fatal   : Could not open input file /tmp/tmpxft_0000193c_00000000-7_sampling.cpp1.ii
(EngineCore_0 pid=6321) [3/4] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output sampling/flashinfer_sampling_ops.cuda.o.d -DTORCH_EXTENSION_NAME=sampling -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/admin/.pyenv/versions/3.11.9/include/python3.11 -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /usr/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 -O3 -std=c++17 --threads=24 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc/flashinfer_sampling_ops.cu -o sampling/flashinfer_sampling_ops.cuda.o 
(EngineCore_0 pid=6321) FAILED: [code=1] sampling/flashinfer_sampling_ops.cuda.o 
(EngineCore_0 pid=6321) /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output sampling/flashinfer_sampling_ops.cuda.o.d -DTORCH_EXTENSION_NAME=sampling -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/admin/.pyenv/versions/3.11.9/include/python3.11 -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include -isystem /home/admin/.venv/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /usr/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 -O3 -std=c++17 --threads=24 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /home/admin/.venv/lib/python3.11/site-packages/flashinfer/data/csrc/flashinfer_sampling_ops.cu -o sampling/flashinfer_sampling_ops.cuda.o 
(EngineCore_0 pid=6321) In file included from /usr/include/crt/math_functions.h:10551,
(EngineCore_0 pid=6321)                  from /usr/include/crt/common_functions.h:303,
(EngineCore_0 pid=6321)                  from /usr/include/cuda_runtime.h:118,
(EngineCore_0 pid=6321)                  from <command-line>:
(EngineCore_0 pid=6321) /usr/include/c++/11/cmath:45:15: fatal error: math.h: No such file or directory
(EngineCore_0 pid=6321)    45 | #include_next <math.h>
(EngineCore_0 pid=6321)       |               ^~~~~~~~
(EngineCore_0 pid=6321) compilation terminated.
(EngineCore_0 pid=6321) In file included from /usr/include/crt/math_functions.h:10551,
(EngineCore_0 pid=6321)                  from /usr/include/crt/common_functions.h:303,
(EngineCore_0 pid=6321)                  from /usr/include/cuda_runtime.h:118,
(EngineCore_0 pid=6321)                  from <command-line>:
(EngineCore_0 pid=6321) /usr/include/c++/11/cmath:45:15: fatal error: math.h: No such file or directory
(EngineCore_0 pid=6321)    45 | #include_next <math.h>
(EngineCore_0 pid=6321)       |               ^~~~~~~~
(EngineCore_0 pid=6321) compilation terminated.
(EngineCore_0 pid=6321) fatal   : Could not open input file /tmp/tmpxft_0000193e_00000000-7_flashinfer_sampling_ops.cpp1.ii
(EngineCore_0 pid=6321) ninja: build stopped: subcommand failed.
(EngineCore_0 pid=6321) 
[rank0]:[W820 04:27:31.206065088 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=6216) Traceback (most recent call last):
(APIServer pid=6216)   File "/home/admin/.venv/bin/vllm", line 10, in <module>
(APIServer pid=6216)     sys.exit(main())
(APIServer pid=6216)              ^^^^^^
(APIServer pid=6216)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=6216)     args.dispatch_function(args)
(APIServer pid=6216)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=6216)     uvloop.run(run_server(args))
(APIServer pid=6216)   File "/home/admin/.venv/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run
(APIServer pid=6216)     return runner.run(wrapper())
(APIServer pid=6216)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6216)   File "/home/admin/.pyenv/versions/3.11.9/lib/python3.11/asyncio/runners.py", line 118, in run
(APIServer pid=6216)     return self._loop.run_until_complete(task)
(APIServer pid=6216)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6216)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=6216)   File "/home/admin/.venv/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
(APIServer pid=6216)     return await main
(APIServer pid=6216)            ^^^^^^^^^^
(APIServer pid=6216)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1850, in run_server
(APIServer pid=6216)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=6216)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1870, in run_server_worker
(APIServer pid=6216)     async with build_async_engine_client(
(APIServer pid=6216)   File "/home/admin/.pyenv/versions/3.11.9/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=6216)     return await anext(self.gen)
(APIServer pid=6216)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6216)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client
(APIServer pid=6216)     async with build_async_engine_client_from_engine_args(
(APIServer pid=6216)   File "/home/admin/.pyenv/versions/3.11.9/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=6216)     return await anext(self.gen)
(APIServer pid=6216)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6216)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 220, in build_async_engine_client_from_engine_args
(APIServer pid=6216)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=6216)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6216)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/utils/__init__.py", line 1557, in inner
(APIServer pid=6216)     return fn(*args, **kwargs)
(APIServer pid=6216)            ^^^^^^^^^^^^^^^^^^^
(APIServer pid=6216)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/engine/async_llm.py", line 174, in from_vllm_config
(APIServer pid=6216)     return cls(
(APIServer pid=6216)            ^^^^
(APIServer pid=6216)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/engine/async_llm.py", line 120, in __init__
(APIServer pid=6216)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=6216)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6216)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=6216)     return AsyncMPClient(*client_args)
(APIServer pid=6216)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6216)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 767, in __init__
(APIServer pid=6216)     super().__init__(
(APIServer pid=6216)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 446, in __init__
(APIServer pid=6216)     with launch_core_engines(vllm_config, executor_class,
(APIServer pid=6216)   File "/home/admin/.pyenv/versions/3.11.9/lib/python3.11/contextlib.py", line 144, in __exit__
(APIServer pid=6216)     next(self.gen)
(APIServer pid=6216)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/engine/utils.py", line 706, in launch_core_engines
(APIServer pid=6216)     wait_for_engine_startup(
(APIServer pid=6216)   File "/home/admin/.venv/lib/python3.11/site-packages/vllm/v1/engine/utils.py", line 759, in wait_for_engine_startup
(APIServer pid=6216)     raise RuntimeError("Engine core initialization failed. "
(APIServer pid=6216) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

RunLLM · August 20, 2025, 4:46am

Your error is due to missing system headers (specifically math.h) required for CUDA compilation when FlashInfer tries to JIT build its kernels. This is a system-level issue, not a vLLM or FlashInfer bug. You need to install the development libraries for your system’s C standard library.

On Ubuntu/Debian, run:

sudo apt-get install build-essential
sudo apt-get install libc6-dev

This will provide math.h and other required headers. After installing, try running vLLM again. For more context, see similar issues and solutions in vllm-project/vllm#22008.

Would you like more troubleshooting steps or details on JIT/AoT compilation for FlashInfer?

Sources:

vllm-project/vllm#22008

teddythinh · August 20, 2025, 4:47am

Here is my env:

==============================
        System Info
==============================
OS                           : Ubuntu 24.04.3 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.1+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.11.9 (main, Jul  1 2025, 07:11:31) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.11.0-1018-azure-x86_64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.0.140
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : GPU 0: NVIDIA A100 80GB PCIe
Nvidia driver version        : 570.172.08
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               24
On-line CPU(s) list:                  0-23
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7V13 64-Core Processor
CPU family:                           25
Model:                                1
Thread(s) per core:                   1
Core(s) per socket:                   24
Socket(s):                            1
Stepping:                             1
BogoMIPS:                             4890.89
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves user_shstk clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm
Hypervisor vendor:                    Microsoft
Virtualization type:                  full
L1d cache:                            768 KiB (24 instances)
L1i cache:                            768 KiB (24 instances)
L2 cache:                             12 MiB (24 instances)
L3 cache:                             96 MiB (3 instances)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-23
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:      Vulnerable
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.3.14
[pip3] nvidia-cuda-cupti-cu12==12.8.57
[pip3] nvidia-cuda-nvrtc-cu12==12.8.61
[pip3] nvidia-cuda-runtime-cu12==12.8.57
[pip3] nvidia-cudnn-cu12==9.7.1.26
[pip3] nvidia-cudnn-frontend==1.14.0
[pip3] nvidia-cufft-cu12==11.3.3.41
[pip3] nvidia-cufile-cu12==1.13.0.11
[pip3] nvidia-curand-cu12==10.3.9.55
[pip3] nvidia-cusolver-cu12==11.7.2.55
[pip3] nvidia-cusparse-cu12==12.5.7.53
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-ml-py==12.575.51
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.8.61
[pip3] nvidia-nvtx-cu12==12.8.55
[pip3] pynvml==12.0.0
[pip3] pyzmq==27.0.1
[pip3] torch==2.7.1+cu128
[pip3] torchaudio==2.7.1+cu128
[pip3] torchvision==0.22.1+cu128
[pip3] transformers==4.55.2
[pip3] triton==3.3.1
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.10.1
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
        GPU0    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    0-23    0               N/A
NIC0    NODE     X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

==============================
     Environment Variables
==============================
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

RunLLM · August 20, 2025, 4:47am

Your error is caused by missing system headers—specifically, math.h—during the JIT compilation of FlashInfer’s CUDA kernels. This is a known issue when building vLLM/FlashInfer on some Linux distributions (especially Ubuntu 24.04 or when certain dev packages are missing). The error is not specific to vLLM or FlashInfer versions, but to your system’s C/C++ development environment.

Solution:
Install the required development libraries. On Ubuntu/Debian, run:

sudo apt-get update
sudo apt-get install build-essential python3-dev

If you are on Ubuntu 24.04, you may also need:

sudo apt-get install libc6-dev

After installing, clear the FlashInfer cache and try again:

rm -rf ~/.cache/flashinfer

Then restart your vLLM server.

For more details, see vllm-project/vllm#22008 and related comments.

Would you like more troubleshooting steps or help for a different OS?

Sources:

vllm-project/vllm#22008

RunLLM · August 20, 2025, 4:47am

Your environment is missing system development headers required for CUDA JIT compilation (e.g., math.h), which causes FlashInfer and vLLM to fail on Ubuntu 24.04. To fix, install the following packages:

sudo apt-get update
sudo apt-get install build-essential python3-dev libc6-dev

This will provide the necessary C/C++ headers for nvcc. For more context, see vllm-project/vllm#22008.

Would you like more details on troubleshooting or JIT/AoT compilation?

Sources:

vllm-project/vllm#22008

Topic		Replies	Views
Build VLLM without installing gcc? General	1	321	June 20, 2025
vLLM on RTX5090: Working GPU setup with torch 2.9.0 cu128 NVIDIA GPU Support	15	1273	September 3, 2025
Installation issue General	8	194	July 3, 2025
Can anyone help me? Why is this not working? It used 😭 NVIDIA GPU Support	1	602	May 8, 2025
Flash Attention version General	1	280	September 27, 2025

FlashInfer latest version is not working with vLLM

Related topics