Hardware Setup
-
Device: NVIDIA Jetson Thor 128GB
-
Inference Framework: vLLM (nightly-aarch64)
-
Model:
sakamakismile/Qwen3.6-27B-NVFP4(quantized) -
Interface: Open WebUI
Problem Statement
When running inference with larger prompts (beyond a certain threshold), the model exhibits unexpected latency spikes:
-
Initial Thinking Phase: 5–10 minutes before thinking/reasoning output appears
-
Thinking Processing: Additional 5–10 minutes to complete the thinking process
-
Final Response: Generation of the actual response follows
Simple/short prompts work correctly with normal response times, but complex or token-heavy prompts trigger this behavior consistently.
Current Configuration
docker run --rm \
--name vllm-qwen36 \
--device nvidia.com/gpu=all \
--network host \
--ipc=host \
--shm-size=16g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-e HF_TOKEN=<hf_token> \
-e HF_HOME=/root/.cache/huggingface \
-v /root/ai-stack/hf-cache:/root/.cache/huggingface \
--entrypoint bash \
vllm/vllm-openai:nightly-aarch64 \
-c "pip install -q 'vllm[audio]' && vllm serve sakamakismile/Qwen3.6-27B-NVFP4 \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.45 \
--max-model-len 16384 \
--max-num-seqs 8 \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--trust-remote-code"
Key Configuration Parameters
| Parameter | Value | Rationale |
|---|---|---|
gpu-memory-utilization |
0.45 | Conservative to avoid OOM on quantized model |
max-model-len |
16384 | Limited context window |
max-num-seqs |
8 | Batch size constraint |
enable-prefix-caching |
✓ | Enabled for efficiency |
reasoning-parser |
qwen3 | Qwen3 reasoning/thinking support |
Observations & Questions
Suspected Root Causes
-
Reasoning Token Expansion: Is the
--reasoning-parser qwen3flag causing internal token explosion during the thinking phase? -
Context Window Pressure: Could the 16384
max-model-lenbe insufficient, causing re-computation or re-tokenization? -
Quantization Overhead: Does NVFP4 quantization introduce computation overhead during reasoning phases?
-
GPU Memory Thrashing: At 0.45 GPU utilization, is the model spilling to system RAM or host memory?
What I’ve Already Checked
-
✓ No Docker resource exhaustion (memory/CPU stable during inference)
-
✓ Network latency ruled out (local inference via localhost:8000)
-
✓ Model loads successfully; simple prompts execute quickly
-
✓ vLLM logs show no obvious errors or warnings
Questions for the Community
-
Is there a known performance regression with Qwen3.6-27B on Jetson Thor when using the reasoning parser?
-
Should I disable
--reasoning-parser qwen3and use a different approach to handle reasoning tokens? -
Are there optimal
gpu-memory-utilizationormax-model-lenvalues for this model/hardware combo that reduce latency? -
Would increasing
--max-num-seqsor adjusting--chunk-sizehelp with thinking-phase throughput? -
Is TensorRT optimization available for Qwen3.6 on Jetson to bypass the vLLM reasoning bottleneck?
Expected Behavior
-
Simple prompt (~100 tokens) → immediate response (< 5 seconds)
-
Complex prompt (~2000 tokens with reasoning) → response within 1–2 minutes total
Actual Behavior
-
Simple prompt (~100 tokens) → immediate response ✓
-
Complex prompt (~2000 tokens with reasoning) → 10–20 minute total latency ✗
Any insights on optimization or alternative configurations appreciated!