Hello,
i’ve been trying for 2 weeks now to compile and run vLLM on Jetson Thor. I’m just too stup$§! todo so.
Assuming I have a vllm conda env. on both machines:
(vllm) root@jetson-orin:/usr/src/vllm# /root/cuda_test.sh
PyTorch version: 2.8.0
CUDA available: True
GPU name: Orin
Tensor sum: 50000504.0
(vllm) root@jetson-thor:/usr/src/vllm/build# /root/cuda_test.sh
PyTorch version: 2.9.0
CUDA available: True
GPU name: NVIDIA Thor
Tensor sum: 49998336.0
On Orin:
export CUDA_HOME=/usr/local/cuda
export CUSPARSELT_DIR=/usr/src/libcusparse_lt-linux-aarch64-0.8.1.1_cuda13-archive/
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:${CUSPARSELT_DIR}/lib:$LD_LIBRARY_PATH
export PATH=$PATH:/usr/local/go/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:$CUDA_HOME/bin
export USE_CUDA=1
export TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;8.7;8.9;10.0;11.0"
export CMAKE_CUDA_ARCHITECTURES=native
export CUTLASS_NVCC_ARCHS="110"
export FORCE_CUDA=1 #this env var is what causes compilation of nms.
export MAX_JOBS=10
export USE_CUDNN=1
export USE_CUSPARSELT=1
export USE_MARLIN=1
export USE_GPTQ=1
export USE_AWQ=1
export CCACHE_NOHASHDIR="true"
export VLLM_CUTLASS_SRC_DIR=/usr/src/cutlass-4.2.1/
CCACHE_NOHASHDIR="true" uv pip install -v -e . --no-build-isolation --no-cache-dir
(vllm) root@jetson-orin:/usr/src/vllm# pip list | grep -e torch -e triton
WARNING: Ignoring invalid distribution -orch (/root/miniconda3/envs/vllm/lib/python3.10/site-packages)
WARNING: Ignoring invalid distribution -ympy (/root/miniconda3/envs/vllm/lib/python3.10/site-packages)
torch 2.8.0
torchaudio 2.8.0
torchvision 0.23.0 /root/miniconda3/envs/vllm/lib/python3.10/site-packages
triton 3.4.0
(Worker pid=31078) INFO 10-19 07:33:46 [multiproc_executor.py:589] Parent process exited, terminating worker
(APIServer pid=30990) Traceback (most recent call last):
(APIServer pid=30990) File "/root/miniconda3/envs/vllm/bin/vllm", line 7, in <module>
(APIServer pid=30990) sys.exit(main())
(APIServer pid=30990) File "/usr/src/vllm/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=30990) args.dispatch_function(args)
(APIServer pid=30990) File "/usr/src/vllm/vllm/entrypoints/cli/serve.py", line 62, in cmd
(APIServer pid=30990) uvloop.run(run_server(args))
(APIServer pid=30990) File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
(APIServer pid=30990) return loop.run_until_complete(wrapper())
(APIServer pid=30990) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=30990) File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
(APIServer pid=30990) return await main
(APIServer pid=30990) File "/usr/src/vllm/vllm/entrypoints/openai/api_server.py", line 1920, in run_server
(APIServer pid=30990) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=30990) File "/usr/src/vllm/vllm/entrypoints/openai/api_server.py", line 1943, in run_server_worker
(APIServer pid=30990) await init_app_state(engine_client, app.state, args)
(APIServer pid=30990) File "/usr/src/vllm/vllm/entrypoints/openai/api_server.py", line 1696, in init_app_state
(APIServer pid=30990) OpenAIServingResponses(
(APIServer pid=30990) File "/usr/src/vllm/vllm/entrypoints/openai/serving_responses.py", line 178, in __init__
(APIServer pid=30990) get_stop_tokens_for_assistant_actions()
(APIServer pid=30990) File "/usr/src/vllm/vllm/entrypoints/harmony_utils.py", line 444, in get_stop_tokens_for_assistant_actions
(APIServer pid=30990) return get_encoding().stop_tokens_for_assistant_actions()
(APIServer pid=30990) File "/usr/src/vllm/vllm/entrypoints/harmony_utils.py", line 75, in get_encoding
(APIServer pid=30990) _harmony_encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)
(APIServer pid=30990) File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/openai_harmony/__init__.py", line 689, in load_harmony_encoding
(APIServer pid=30990) inner: _PyHarmonyEncoding = _load_harmony_encoding(name)
(APIServer pid=30990) openai_harmony.HarmonyError: error downloading or loading vocab file: failed to download or load vocab file
I was able to fix harmony using a custom compile of the source and is up + running.
On Thor:
(vllm) root@jetson-thor:/usr/src/vllm/build# pip list | grep -e torch -e triton
open_clip_torch 3.2.0
pytorch-lightning 2.5.5
pytorch-msssim 1.0.0
pytorch3d 0.7.8
slangtorch 1.3.13
torch 2.9.0
torch_scatter 2.1.2
torch-tb-profiler 0.4.3
torchaudio 2.9.0
torchcodec 0.7.0
torchdiffeq 0.2.5
torchmetrics 1.8.2
torchsde 0.2.6
torchtyping 0.1.5
torchvision 0.24.0
triton 3.5.0
(Worker pid=522135)
(Worker pid=522135) INFO 10-19 09:33:23 [default_loader.py:267] Loading weights took 182.40 seconds
(Worker pid=522135) WARNING 10-19 09:33:23 [marlin_utils_fp4.py:196] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will
be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] WorkerProc failed to start.
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] Traceback (most recent call last):
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] File "/usr/src/vllm/vllm/v1/executor/multiproc_executor.py", line 571, in worker_main
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] worker = WorkerProc(*args, **kwargs)
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] File "/usr/src/vllm/vllm/v1/executor/multiproc_executor.py", line 437, in __init__
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] self.worker.load_model()
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] File "/usr/src/vllm/vllm/v1/worker/gpu_worker.py", line 213, in load_model
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] File "/usr/src/vllm/vllm/v1/worker/gpu_model_runner.py", line 2635, in load_model
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] self.model = model_loader.load_model(
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] File "/usr/src/vllm/vllm/model_executor/model_loader/base_loader.py", line 51, in load_model
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] process_weights_after_loading(model, model_config, target_device)
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] File "/usr/src/vllm/vllm/model_executor/model_loader/utils.py", line 112, in process_weights_after_loading
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] quant_method.process_weights_after_loading(module)
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] File "/usr/src/vllm/vllm/model_executor/layers/quantization/mxfp4.py", line 304, in process_weights_after_loading
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] prepare_moe_fp4_layer_for_marlin(layer)
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] File "/usr/src/vllm/vllm/model_executor/layers/quantization/utils/marlin_utils_fp4.py", line 228, in prepare_moe_fp4_layer_fo
r_marlin
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] qweight = weight[i].view(torch.int32).T.contiguous()
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for mor
e information.
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(Worker pid=522135) ERROR 10-19 09:33:23 [multiproc_executor.py:597]
(Worker pid=522135) INFO 10-19 09:33:23 [multiproc_executor.py:558] Parent process exited, terminating worker
[rank0]:[W1019 09:33:24.303605950 ProcessGroupNCCL.cpp:1541] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https
i’ve tried all sorts of variations to compile and just can’t make it work. Very frustrating.
I’ve tried to all sort of compilations but when not setting TORCH_CUDA_ARCH_LISTI get
Not building Marlin MOE kernels as no compatible archs found in CUDA target architectures
and the run fails on gpt-oss. Also I see that Flash-Attention wont work properly when CUDA arch is not detected. Otherwise it will set detect 8.0/8.0+PTX or 9.0/9.0+PTX.
DEBUG -- CUDA target architectures:
DEBUG -- CUDA supported target architectures:
DEBUG -- FA2_ARCHS:
DEBUG -- FA3_ARCHS:
DEBUG -- vllm-flash-attn is available at /usr/src/vllm/.deps/vllm-flash-attn-src
Then I get:
Worker pid=536101) INFO 10-19 10:18:03 [default_loader.py:267] Loading weights took 183.15 seconds
(Worker pid=536101) WARNING 10-19 10:18:03 [marlin_utils_fp4.py:196] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will
be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] WorkerProc failed to start.
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] Traceback (most recent call last):
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] File "/usr/src/vllm/vllm/v1/executor/multiproc_executor.py", line 571, in worker_main
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] worker = WorkerProc(*args, **kwargs)
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] File "/usr/src/vllm/vllm/v1/executor/multiproc_executor.py", line 437, in __init__
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] self.worker.load_model()
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] File "/usr/src/vllm/vllm/v1/worker/gpu_worker.py", line 213, in load_model
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] File "/usr/src/vllm/vllm/v1/worker/gpu_model_runner.py", line 2635, in load_model
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] self.model = model_loader.load_model(
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] File "/usr/src/vllm/vllm/model_executor/model_loader/base_loader.py", line 51, in load_model
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] process_weights_after_loading(model, model_config, target_device)
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] File "/usr/src/vllm/vllm/model_executor/model_loader/utils.py", line 112, in process_weights_after_loading
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] quant_method.process_weights_after_loading(module)
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] File "/usr/src/vllm/vllm/model_executor/layers/quantization/mxfp4.py", line 304, in process_weights_after_loading
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] prepare_moe_fp4_layer_for_marlin(layer)
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] File "/usr/src/vllm/vllm/model_executor/layers/quantization/utils/marlin_utils_fp4.py", line 230, in prepare_moe_fp4_layer_fo
r_marlin
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] marlin_qweight = ops.gptq_marlin_repack(b_q_weight=qweight,
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] File "/usr/src/vllm/vllm/_custom_ops.py", line 956, in gptq_marlin_repack
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] return torch.ops._C.gptq_marlin_repack(b_q_weight, perm, size_k, size_n,
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] File "/opt/local/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/_ops.py", line 1255, in __call__
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] return self._op(*args, **kwargs)
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=536101) ERROR 10-19 10:18:03 [multiproc_executor.py:597] NotImplementedError: Could not run '_C::gptq_marlin_repack' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. '_C::gptq_marlin_repack' is only available for these backends: [Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMTIA, AutogradMAIA, AutogradMeta, Tracer, AutocastCPU, AutocastMTIA, AutocastMAIA, AutocastXPU, AutocastMPS, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].
or
DEBUG -- CUDA target architectures: 8.9
DEBUG -- CUDA supported target architectures: 8.9
DEBUG -- FA2_ARCHS: 8.0+PTX
DEBUG -- FA3_ARCHS: 8.0
DEBUG -- vllm-flash-attn is available at /usr/src/vllm/.deps/vllm-flash-attn-src
with this wonderful error:
(Worker pid=544462) ERROR 10-19 12:18:42 [multiproc_executor.py:597] qweight = weight[i].view(torch.int32).T.contiguous()
(Worker pid=544462) ERROR 10-19 12:18:42 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=544462) ERROR 10-19 12:18:42 [multiproc_executor.py:597] torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
(Worker pid=544462) ERROR 10-19 12:18:42 [multiproc_executor.py:597] Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for mor
e information.
(Worker pid=544462) ERROR 10-19 12:18:42 [multiproc_executor.py:597] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(Worker pid=544462) ERROR 10-19 12:18:42 [multiproc_executor.py:597] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(Worker pid=544462) ERROR 10-19 12:18:42 [multiproc_executor.py:597] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(Worker pid=544462) ERROR 10-19 12:18:42 [multiproc_executor.py:597]
(Worker pid=544462) INFO 10-19 12:18:42 [multiproc_executor.py:558] Parent process exited, terminating worker
[rank0]:[W1019 12:18:43.420660554 ProcessGroupNCCL.cpp:1541] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https
://pytorch.org/docs/stable/distributed.html#shutdown (function operator())