I’m testing Qwen3-0.6B variants (FP16 / FP8 / GPTQ-Int8 / GPTQ-Int4) on a Jetson AGX Orin (JetPack 6.2) using vLLM, and seeing some unexpected behavior.
Observed behavior:
jtop shows GPU VRAM usage at ~25GB, and the process is clearly using the GPU.
However, generation speed stays around 23–26 tok/s across all models, with no speedup from FP8 or GPTQ quantization.
Even more surprising: the vLLM GPU version is slower than the MNN CPU-only version, which runs at ~42 tok/s.
Questions:
Why isn’t FP8 / GPTQ giving the expected speed boost on Orin with vLLM?
Is this a Jetson-specific limitation with vLLM, or are there key optimizations I’m missing?
Are there known ways to improve vLLM performance for small models like 0.6B on Orin?
Context:
vLLM built from source for JetPack 6.2 / CUDA 12.6.
Running with enforce_eager=True, logs show Inductor compilation was disabled.
Quantization settings are set as per docs (quantization=fp8 for FP8, gptq_marlin for GPTQ).
Any insights or tips would be really appreciated! Thanks.
FP8 and GPTQ quantization in vLLM do not always provide speedup on Jetson Orin because vLLM’s quantized kernels (especially for FP8 and GPTQ) are primarily optimized for server-class GPUs (Ampere, Ada, Hopper), not embedded devices like Orin. On Jetson, kernel support and hardware acceleration for these quantizations are limited, so you may see little or no throughput gain, and sometimes even slower performance compared to FP16 or CPU inference. This is a known limitation and not specific to your setup—see discussion in this issue.
Currently, there are no Jetson-specific optimizations in vLLM for small models or for maximizing quantization speedup. For small models like 0.6B, the overhead of vLLM’s batching and scheduling may outweigh any quantization benefit, especially on embedded hardware. You may get better performance with frameworks or runtimes specifically optimized for Jetson, or by using CPU inference for small models. Would you like more detail on kernel support or possible workarounds?