vLLM Qwen3.6-27B Extended Latency on Jetson Thor 128GB with Large Prompts

Hardware Setup

  • Device: NVIDIA Jetson Thor 128GB

  • Inference Framework: vLLM (nightly-aarch64)

  • Model: sakamakismile/Qwen3.6-27B-NVFP4 (quantized)

  • Interface: Open WebUI

Problem Statement

When running inference with larger prompts (beyond a certain threshold), the model exhibits unexpected latency spikes:

  1. Initial Thinking Phase: 5–10 minutes before thinking/reasoning output appears

  2. Thinking Processing: Additional 5–10 minutes to complete the thinking process

  3. Final Response: Generation of the actual response follows

Simple/short prompts work correctly with normal response times, but complex or token-heavy prompts trigger this behavior consistently.

Current Configuration

docker run --rm \
  --name vllm-qwen36 \
  --device nvidia.com/gpu=all \
  --network host \
  --ipc=host \
  --shm-size=16g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -e HF_TOKEN=<hf_token> \
  -e HF_HOME=/root/.cache/huggingface \
  -v /root/ai-stack/hf-cache:/root/.cache/huggingface \
  --entrypoint bash \
  vllm/vllm-openai:nightly-aarch64 \
  -c "pip install -q 'vllm[audio]' && vllm serve sakamakismile/Qwen3.6-27B-NVFP4 \
      --host 0.0.0.0 \
      --port 8000 \
      --gpu-memory-utilization 0.45 \
      --max-model-len 16384 \
      --max-num-seqs 8 \
      --enable-prefix-caching \
      --reasoning-parser qwen3 \
      --enable-auto-tool-choice \
      --tool-call-parser qwen3_coder \
      --trust-remote-code"


Key Configuration Parameters

Parameter Value Rationale
gpu-memory-utilization 0.45 Conservative to avoid OOM on quantized model
max-model-len 16384 Limited context window
max-num-seqs 8 Batch size constraint
enable-prefix-caching Enabled for efficiency
reasoning-parser qwen3 Qwen3 reasoning/thinking support

Observations & Questions

Suspected Root Causes

  1. Reasoning Token Expansion: Is the --reasoning-parser qwen3 flag causing internal token explosion during the thinking phase?

  2. Context Window Pressure: Could the 16384 max-model-len be insufficient, causing re-computation or re-tokenization?

  3. Quantization Overhead: Does NVFP4 quantization introduce computation overhead during reasoning phases?

  4. GPU Memory Thrashing: At 0.45 GPU utilization, is the model spilling to system RAM or host memory?

What I’ve Already Checked

  • ✓ No Docker resource exhaustion (memory/CPU stable during inference)

  • ✓ Network latency ruled out (local inference via localhost:8000)

  • ✓ Model loads successfully; simple prompts execute quickly

  • ✓ vLLM logs show no obvious errors or warnings

Questions for the Community

  1. Is there a known performance regression with Qwen3.6-27B on Jetson Thor when using the reasoning parser?

  2. Should I disable --reasoning-parser qwen3 and use a different approach to handle reasoning tokens?

  3. Are there optimal gpu-memory-utilization or max-model-len values for this model/hardware combo that reduce latency?

  4. Would increasing --max-num-seqs or adjusting --chunk-size help with thinking-phase throughput?

  5. Is TensorRT optimization available for Qwen3.6 on Jetson to bypass the vLLM reasoning bottleneck?

Expected Behavior

  • Simple prompt (~100 tokens) → immediate response (< 5 seconds)

  • Complex prompt (~2000 tokens with reasoning) → response within 1–2 minutes total

Actual Behavior

  • Simple prompt (~100 tokens) → immediate response ✓

  • Complex prompt (~2000 tokens with reasoning) → 10–20 minute total latency ✗


Any insights on optimization or alternative configurations appreciated!

There is no explicit documentation of a vLLM performance regression for Qwen3.6-27B on Jetson Thor with the reasoning parser, but similar issues have been reported for Qwen3.x models on other hardware: long prompts or high concurrency can cause severe latency, especially during the “thinking” phase, due to the hybrid architecture (DeltaNet/Mamba layers) and quantization overhead. The --reasoning-parser qwen3 flag can increase token processing during the reasoning phase, and NVFP4 quantization may further slow down prefill and thinking, particularly on ARM-based Jetson devices, which have lower memory bandwidth and less optimized kernel support compared to x86_64 platforms. Prefix caching can also introduce slowdowns or output issues for large prompts or batches, especially with Qwen3.x models and quantized weights, as seen in related reports. There is no mention of TensorRT optimization for Qwen3.6 in vLLM on Jetson Thor in the provided sources.

To mitigate latency: (1) Try disabling --reasoning-parser qwen3 to see if the bottleneck is in the reasoning extraction; (2) Lower --max-num-seqs to reduce batch pressure; (3) Increase --gpu-memory-utilization cautiously, monitoring for OOM, as too low a value can cause excessive swapping or recomputation; (4) Consider disabling --enable-prefix-caching, as it can degrade performance with large prompts on Qwen3.x; (5) There is no documented --chunk-size parameter in vLLM serve CLI, so chunking cannot be tuned directly; (6) For Jetson Thor, ensure you use the latest vLLM nightly and compatible CUDA/ROCm stack, but expect lower throughput than on server GPUs. For more detail or step-by-step tuning, let me know.

Sources:

Hint: Mention @RunLLM in the post for followups.