Jetson orin, CUDA error: no kernel image is available for execution on the device

I build vllm from source on jetson orin using jetpack 6.0, cuda 12.2 and got this error when running vllm

ERROR 03-29 14:25:00 [core.py:343] out, softmax_lse = torch.ops._vllm_fa2_C.varlen_fwd(
ERROR 03-29 14:25:00 [core.py:343] File “/mnt/huynq/archiconda3/envs/test1/lib/python3.10/site-packages/torch/_ops.py”, line 1123, in call
ERROR 03-29 14:25:00 [core.py:343] return self._op(*args, **(kwargs or {}))
ERROR 03-29 14:25:00 [core.py:343] RuntimeError: CUDA error: no kernel image is available for execution on the device
ERROR 03-29 14:25:00 [core.py:343] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
ERROR 03-29 14:25:00 [core.py:343]
ERROR 03-29 14:25:00 [core.py:343]
CRITICAL 03-29 14:25:00 [core_client.py:269] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed

I thought that is pytorch issue but when i test with this sample code
“”"
import torch
print(torch.cuda.is_available())
print(torch.cuda.current_device())
print(torch.cuda.get_device_name(0))
x = torch.rand(3, 3).cuda()
print(x)
“”"
It turn out working fine with these output
“”"
True
0
Orin
tensor([[0.3726, 0.3407, 0.0371],
[0.2632, 0.6325, 0.0618],
[0.0937, 0.5769, 0.7482]], device=‘cuda:0’)
“”"