Need help compiling and running on Jetson Thor

Hello,

i’ve been trying for 2 weeks now to compile and run vLLM on Jetson Thor. I’m just too stup$§! todo so.

Assuming I have a vllm conda env. on both machines:

(vllm) root@jetson-orin:/usr/src/vllm# /root/cuda_test.sh 
PyTorch version: 2.8.0
CUDA available: True
GPU name: Orin
Tensor sum: 50000504.0
(vllm) root@jetson-thor:/usr/src/vllm/build# /root/cuda_test.sh 
PyTorch version: 2.9.0
CUDA available: True
GPU name: NVIDIA Thor
Tensor sum: 49998336.0

On Orin:

export CUDA_HOME=/usr/local/cuda
export CUSPARSELT_DIR=/usr/src/libcusparse_lt-linux-aarch64-0.8.1.1_cuda13-archive/
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:${CUSPARSELT_DIR}/lib:$LD_LIBRARY_PATH
export PATH=$PATH:/usr/local/go/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:$CUDA_HOME/bin
export USE_CUDA=1
export TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;8.7;8.9;10.0;11.0"
export CMAKE_CUDA_ARCHITECTURES=native
export CUTLASS_NVCC_ARCHS="110"
export FORCE_CUDA=1  #this env var is what causes compilation of nms.
export MAX_JOBS=10
export USE_CUDNN=1
export USE_CUSPARSELT=1
export USE_MARLIN=1
export USE_GPTQ=1
export USE_AWQ=1
export CCACHE_NOHASHDIR="true" 
export VLLM_CUTLASS_SRC_DIR=/usr/src/cutlass-4.2.1/

CCACHE_NOHASHDIR="true" uv pip install -v -e . --no-build-isolation --no-cache-dir
(vllm) root@jetson-orin:/usr/src/vllm# pip list | grep -e torch -e triton
WARNING: Ignoring invalid distribution -orch (/root/miniconda3/envs/vllm/lib/python3.10/site-packages)
WARNING: Ignoring invalid distribution -ympy (/root/miniconda3/envs/vllm/lib/python3.10/site-packages)
torch                             2.8.0
torchaudio                        2.8.0
torchvision                       0.23.0                                     /root/miniconda3/envs/vllm/lib/python3.10/site-packages
triton                            3.4.0

(Worker pid=31078) INFO 10-19 07:33:46 [multiproc_executor.py:589] Parent process exited, terminating worker
(APIServer pid=30990) Traceback (most recent call last):
(APIServer pid=30990)   File "/root/miniconda3/envs/vllm/bin/vllm", line 7, in <module>
(APIServer pid=30990)     sys.exit(main())
(APIServer pid=30990)   File "/usr/src/vllm/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=30990)     args.dispatch_function(args)
(APIServer pid=30990)   File "/usr/src/vllm/vllm/entrypoints/cli/serve.py", line 62, in cmd
(APIServer pid=30990)     uvloop.run(run_server(args))
(APIServer pid=30990)   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
(APIServer pid=30990)     return loop.run_until_complete(wrapper())
(APIServer pid=30990)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=30990)   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
(APIServer pid=30990)     return await main
(APIServer pid=30990)   File "/usr/src/vllm/vllm/entrypoints/openai/api_server.py", line 1920, in run_server
(APIServer pid=30990)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=30990)   File "/usr/src/vllm/vllm/entrypoints/openai/api_server.py", line 1943, in run_server_worker
(APIServer pid=30990)     await init_app_state(engine_client, app.state, args)
(APIServer pid=30990)   File "/usr/src/vllm/vllm/entrypoints/openai/api_server.py", line 1696, in init_app_state
(APIServer pid=30990)     OpenAIServingResponses(
(APIServer pid=30990)   File "/usr/src/vllm/vllm/entrypoints/openai/serving_responses.py", line 178, in __init__
(APIServer pid=30990)     get_stop_tokens_for_assistant_actions()
(APIServer pid=30990)   File "/usr/src/vllm/vllm/entrypoints/harmony_utils.py", line 444, in get_stop_tokens_for_assistant_actions
(APIServer pid=30990)     return get_encoding().stop_tokens_for_assistant_actions()
(APIServer pid=30990)   File "/usr/src/vllm/vllm/entrypoints/harmony_utils.py", line 75, in get_encoding
(APIServer pid=30990)     _harmony_encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)
(APIServer pid=30990)   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/openai_harmony/__init__.py", line 689, in load_harmony_encoding
(APIServer pid=30990)     inner: _PyHarmonyEncoding = _load_harmony_encoding(name)
(APIServer pid=30990) openai_harmony.HarmonyError: error downloading or loading vocab file: failed to download or load vocab file

I was able to fix harmony using a custom compile of the source and is up + running.

On Thor:

(vllm) root@jetson-thor:/usr/src/vllm/build# pip list | grep -e torch -e triton
open_clip_torch                   3.2.0
pytorch-lightning                 2.5.5
pytorch-msssim                    1.0.0
pytorch3d                         0.7.8
slangtorch                        1.3.13
torch                             2.9.0
torch_scatter                     2.1.2
torch-tb-profiler                 0.4.3
torchaudio                        2.9.0
torchcodec                        0.7.0
torchdiffeq                       0.2.5
torchmetrics                      1.8.2
torchsde                          0.2.6
torchtyping                       0.1.5
torchvision                       0.24.0
triton                            3.5.0
(Worker pid=522135)
(Worker pid=522135) INFO 10-19 09:33:23 [default_loader.py:267] Loading weights took 182.40 seconds
(Worker pid=522135) WARNING 10-19 09:33:23 [marlin_utils_fp4.py:196] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will 
be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] WorkerProc failed to start.
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] Traceback (most recent call last):
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597]   File "/usr/src/vllm/vllm/v1/executor/multiproc_executor.py", line 571, in worker_main
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597]     worker = WorkerProc(*args, **kwargs)
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597]   File "/usr/src/vllm/vllm/v1/executor/multiproc_executor.py", line 437, in __init__
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597]     self.worker.load_model()
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597]   File "/usr/src/vllm/vllm/v1/worker/gpu_worker.py", line 213, in load_model
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597]   File "/usr/src/vllm/vllm/v1/worker/gpu_model_runner.py", line 2635, in load_model
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597]     self.model = model_loader.load_model(
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597]   File "/usr/src/vllm/vllm/model_executor/model_loader/base_loader.py", line 51, in load_model
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597]     process_weights_after_loading(model, model_config, target_device)
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597]   File "/usr/src/vllm/vllm/model_executor/model_loader/utils.py", line 112, in process_weights_after_loading
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597]     quant_method.process_weights_after_loading(module)
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597]   File "/usr/src/vllm/vllm/model_executor/layers/quantization/mxfp4.py", line 304, in process_weights_after_loading
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597]     prepare_moe_fp4_layer_for_marlin(layer)
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597]   File "/usr/src/vllm/vllm/model_executor/layers/quantization/utils/marlin_utils_fp4.py", line 228, in prepare_moe_fp4_layer_fo
r_marlin
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597]     qweight = weight[i].view(torch.int32).T.contiguous()
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for mor
e information.
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597]
(Worker pid=522135) INFO 10-19 09:33:23 [multiproc_executor.py:558] Parent process exited, terminating worker
[rank0]:[W1019 09:33:24.303605950 ProcessGroupNCCL.cpp:1541] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https

i’ve tried all sorts of variations to compile and just can’t make it work. Very frustrating.

I’ve tried to all sort of compilations but when not setting TORCH_CUDA_ARCH_LISTI get

Not building Marlin MOE kernels as no compatible archs found in CUDA target architectures

and the run fails on gpt-oss. Also I see that Flash-Attention wont work properly when CUDA arch is not detected. Otherwise it will set detect 8.0/8.0+PTX or 9.0/9.0+PTX.

DEBUG -- CUDA target architectures: 
DEBUG -- CUDA supported target architectures: 
DEBUG -- FA2_ARCHS: 
DEBUG -- FA3_ARCHS: 
DEBUG -- vllm-flash-attn is available at /usr/src/vllm/.deps/vllm-flash-attn-src

Then I get:

Worker pid=536101) INFO 10-19 10:18:03 [default_loader.py:267] Loading weights took 183.15 seconds
(Worker pid=536101) WARNING 10-19 10:18:03 [marlin_utils_fp4.py:196] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will 
be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] WorkerProc failed to start.
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] Traceback (most recent call last):
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]   File "/usr/src/vllm/vllm/v1/executor/multiproc_executor.py", line 571, in worker_main
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]     worker = WorkerProc(*args, **kwargs)
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]   File "/usr/src/vllm/vllm/v1/executor/multiproc_executor.py", line 437, in __init__
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]     self.worker.load_model()
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]   File "/usr/src/vllm/vllm/v1/worker/gpu_worker.py", line 213, in load_model
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]   File "/usr/src/vllm/vllm/v1/worker/gpu_model_runner.py", line 2635, in load_model
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]     self.model = model_loader.load_model(
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]   File "/usr/src/vllm/vllm/model_executor/model_loader/base_loader.py", line 51, in load_model
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]     process_weights_after_loading(model, model_config, target_device)
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]   File "/usr/src/vllm/vllm/model_executor/model_loader/utils.py", line 112, in process_weights_after_loading
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]     quant_method.process_weights_after_loading(module)
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]   File "/usr/src/vllm/vllm/model_executor/layers/quantization/mxfp4.py", line 304, in process_weights_after_loading
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]     prepare_moe_fp4_layer_for_marlin(layer)
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]   File "/usr/src/vllm/vllm/model_executor/layers/quantization/utils/marlin_utils_fp4.py", line 230, in prepare_moe_fp4_layer_fo
r_marlin
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]     marlin_qweight = ops.gptq_marlin_repack(b_q_weight=qweight,
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]   File "/usr/src/vllm/vllm/_custom_ops.py", line 956, in gptq_marlin_repack
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]     return torch.ops._C.gptq_marlin_repack(b_q_weight, perm, size_k, size_n,
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]   File "/opt/local/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/_ops.py", line 1255, in __call__
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]     return self._op(*args, **kwargs)
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] NotImplementedError: Could not run '_C::gptq_marlin_repack' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. '_C::gptq_marlin_repack' is only available for these backends: [Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMTIA, AutogradMAIA, AutogradMeta, Tracer, AutocastCPU, AutocastMTIA, AutocastMAIA, AutocastXPU, AutocastMPS, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

or

DEBUG -- CUDA target architectures: 8.9
DEBUG -- CUDA supported target architectures: 8.9
DEBUG -- FA2_ARCHS: 8.0+PTX
DEBUG -- FA3_ARCHS: 8.0
DEBUG -- vllm-flash-attn is available at /usr/src/vllm/.deps/vllm-flash-attn-src

with this wonderful error:

(Worker pid=544462) ERROR 10-19 12:18:42 [multiproc_executor.py:597]     qweight = weight[i].view(torch.int32).T.contiguous()
(Worker pid=544462) ERROR 10-19 12:18:42 [multiproc_executor.py:597]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=544462) ERROR 10-19 12:18:42 [multiproc_executor.py:597] torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
(Worker pid=544462) ERROR 10-19 12:18:42 [multiproc_executor.py:597] Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for mor
e information.
(Worker pid=544462) ERROR 10-19 12:18:42 [multiproc_executor.py:597] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(Worker pid=544462) ERROR 10-19 12:18:42 [multiproc_executor.py:597] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(Worker pid=544462) ERROR 10-19 12:18:42 [multiproc_executor.py:597] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(Worker pid=544462) ERROR 10-19 12:18:42 [multiproc_executor.py:597]
(Worker pid=544462) INFO 10-19 12:18:42 [multiproc_executor.py:558] Parent process exited, terminating worker
[rank0]:[W1019 12:18:43.420660554 ProcessGroupNCCL.cpp:1541] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https
://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Your error “CUDA error: no kernel image is available for execution on the device” means the vLLM CUDA kernels were not compiled for the NVIDIA Thor (sm_110) architecture, so the necessary device code is missing. This is a common issue for new architectures like Thor, as prebuilt vLLM wheels and many build scripts do not yet include sm_110 or sm_120 by default. The same applies to Marlin/FP4 and FlashAttention kernels, which also need explicit arch support in the build.

To fix this, you must:

  1. Set the correct CUDA arch flags: For Thor, set TORCH_CUDA_ARCH_LIST="11.0;11.1;12.0" (or at least “11.0;11.1”) before building vLLM, and ensure your CUDA, PyTorch, and all dependencies are built for CUDA 13+.
  2. Build vLLM from source: Prebuilt wheels do not support Thor. Use uv pip install -e . or pip install -e . after setting the arch flags and environment variables. If you see “Not building Marlin MOE kernels as no compatible archs found”, your arch flags are still wrong.
  3. Check for missing ops: The error “NotImplementedError: Could not run ‘_C::gptq_marlin_repack’ with arguments from the ‘CUDA’ backend” means the custom CUDA op was not built for your device. This is also due to missing arch flags or a failed build.

For more details and troubleshooting, see the official vLLM installation docs and related issues for Thor and Blackwell support (example, example). Would you like a step-by-step build command sequence for Jetson Thor?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM

on 1. using TORCH_CUDA_ARCH_LIST="11.0;11.1;12.0"

DEBUG -- CUDA target architectures: 12.0
DEBUG -- CUDA supported target architectures: 12.0
DEBUG -- FA2_ARCHS: 8.0+PTX
DEBUG -- FA3_ARCHS: 
DEBUG -- vllm-flash-attn is available at /usr/src/vllm/.deps/vllm-flash-attn-src

does not activate FA3! I still get the same issue with torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device

Using TORCH_CUDA_ARCH_LIST="9.0;11.0;11.1;12.0" will activate FA3 but the compiler segfaults, see nvidia forum post:

root@jetson-thor:/usr/src/vllm/build/temp.linux-aarch64-cpython-312# /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_GDC_FOR_SM90 -DFLASHATTENTION_DISABLE_BACKWARD -DFLASHATTENTION_DISABLE_CLUSTER -DFLASHATTENTION_DISABLE_DROPOUT -DFLASHATTENTION_DISABLE_PYBIND -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASHATTENTION_DISABLE_UNEVEN_K -DFLASHATTENTION_VARLEN_ONLY -DPy_LIMITED_API=3 -DTORCH_EXTENSION_NAME=_vllm_fa3_C -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_vllm_fa3_C_EXPORTS -I/usr/src/vllm/.deps/vllm-flash-attn-src/csrc -I/usr/src/vllm/.deps/vllm-flash-attn-src/hopper -I/usr/src/vllm/.deps/vllm-flash-attn-src/csrc/common -I/usr/src/vllm/.deps/vllm-flash-attn-src/csrc/cutlass/include -isystem /opt/local/miniconda3/envs/vllm/include/python3.12 -isystem /opt/local/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/include -isystem /opt/local/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -DONNX_NAMESPACE=onnx_c2 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -g -DNDEBUG -std=c++17 -Xcompiler=-fPIC --expt-relaxed-constexpr -DENABLE_FP8 --threads=1 --expt-extended-lambda --use_fast_math -DCUTLASS_ENABLE_DIRECT_CUDA_DRIVER_CALL=1 -gencode arch=compute_90a,code=sm_90a -MD -MT vllm-flash-attn/CMakeFiles/_vllm_fa3_C.dir/hopper/instantiations/flash_fwd_hdim128_bf16_softcap_packgqa_sm90.cu.o -MF vllm-flash-attn/CMakeFiles/_vllm_fa3_C.dir/hopper/instantiations/flash_fwd_hdim128_bf16_softcap_packgqa_sm90.cu.o.d -x cu -c /usr/src/vllm/.deps/vllm-flash-attn-src/hopper/instantiations/flash_fwd_hdim128_bf16_softcap_packgqa_sm90.cu -o vllm-flash-attn/CMakeFiles/_vllm_fa3_C.dir/hopper/instantiations/flash_fwd_hdim128_bf16_softcap_packgqa_sm90.cu.o
Segmentation fault (core dumped)

So trying from the first suggested link to the issue on github. The recommendation is:

export TORCH_CUDA_ARCH_LIST=11.0a

I get the merlin problem.

(Worker pid=556289) ERROR 10-19 13:26:43 [multiproc_executor.py:597] NotImplementedError: Could not run '_C::gptq_marlin_repack' with arguments from the 'CUDA' backend. This could be because the o
perator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https
://fburl.com/ptmfixes for possible resolutions. '_C::gptq_marlin_repack' is only available for these backends: [Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Co
njugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMTIA, AutogradMAIA, AutogradMeta
, Tracer, AutocastCPU, AutocastMTIA, AutocastMAIA, AutocastXPU, AutocastMPS, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, Python
TLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

You are running into two separate but related issues:

  1. FA3 (FlashAttention 3) not activating:
    When you set TORCH_CUDA_ARCH_LIST="11.0;11.1;12.0", vLLM only detects 12.0 as a supported arch, and FA3 is not built. This is because FA3 kernels are only built for certain architectures (e.g., Hopper/Blackwell, sm_90a, sm_100a, sm_120a), and Thor (sm_110) is not yet officially supported in upstream FlashAttention or vLLM.
    When you add “9.0” to the arch list, the build tries to compile Hopper kernels, but the CUDA compiler segfaults, which is a known issue (see here). This is likely due to missing or incompatible support for sm_90a/Blackwell/Hopper on Thor in the current CUDA/FlashAttention toolchain.

  2. Marlin/FP4 NotImplementedError:
    When you try to use only “11.0a” (or similar), you get a NotImplementedError for _C::gptq_marlin_repack. This means the Marlin CUDA kernel was not built for your device, so the operator is missing for the CUDA backend (see here). This is because the build system does not yet recognize sm_110 (Thor) as a valid target for Marlin/FP4 kernels.

Summary:

  • Thor (sm_110) is not yet fully supported by vLLM, FlashAttention, or Marlin.
  • Adding “9.0” or “9.0a” to the arch list will not help, and may cause compiler segfaults.
  • There is no current workaround to get FA3 or Marlin working on Thor, as the necessary CUDA kernels are not being built for sm_110, and the toolchain is not ready for this architecture.

Recommendation:
You will need to wait for upstream support for Thor (sm_110) in vLLM, FlashAttention, and Marlin. Track this issue for updates. If you want to experiment, you could try patching the build system to treat sm_110 like Hopper/Blackwell, but this is not guaranteed to work and may require deep CUDA/CMake changes.

Would you like more detail on how to attempt a manual patch, or do you want to be notified when official support lands?

Sources:

Hint: Mention @RunLLM in the post for followups.

vllm is now installable on Nvidia Thor with these instructions:

https://forums.developer.nvidia.com/t/run-vllm-in-thor-from-vllm-repository/348804

or using nvidias vllm docker container

https://forums.developer.nvidia.com/t/announcing-new-vllm-container-3-5x-increase-in-gen-ai-performance-in-just-5-weeks-of-jetson-agx-thor-launch/346634`