vLLM Qwen3.6-27B Extended Latency on Jetson Thor 128GB with Large Prompts

Chiragdl · May 20, 2026, 6:14pm

Hardware Setup

Device: NVIDIA Jetson Thor 128GB
Inference Framework: vLLM (nightly-aarch64)
Model: sakamakismile/Qwen3.6-27B-NVFP4 (quantized)
Interface: Open WebUI

Problem Statement

When running inference with larger prompts (beyond a certain threshold), the model exhibits unexpected latency spikes:

Initial Thinking Phase: 5–10 minutes before thinking/reasoning output appears
Thinking Processing: Additional 5–10 minutes to complete the thinking process
Final Response: Generation of the actual response follows

Simple/short prompts work correctly with normal response times, but complex or token-heavy prompts trigger this behavior consistently.

Current Configuration

docker run --rm \
  --name vllm-qwen36 \
  --device nvidia.com/gpu=all \
  --network host \
  --ipc=host \
  --shm-size=16g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -e HF_TOKEN=<hf_token> \
  -e HF_HOME=/root/.cache/huggingface \
  -v /root/ai-stack/hf-cache:/root/.cache/huggingface \
  --entrypoint bash \
  vllm/vllm-openai:nightly-aarch64 \
  -c "pip install -q 'vllm[audio]' && vllm serve sakamakismile/Qwen3.6-27B-NVFP4 \
      --host 0.0.0.0 \
      --port 8000 \
      --gpu-memory-utilization 0.45 \
      --max-model-len 16384 \
      --max-num-seqs 8 \
      --enable-prefix-caching \
      --reasoning-parser qwen3 \
      --enable-auto-tool-choice \
      --tool-call-parser qwen3_coder \
      --trust-remote-code"

Key Configuration Parameters

Parameter	Value	Rationale
`gpu-memory-utilization`	0.45	Conservative to avoid OOM on quantized model
`max-model-len`	16384	Limited context window
`max-num-seqs`	8	Batch size constraint
`enable-prefix-caching`	✓	Enabled for efficiency
`reasoning-parser`	qwen3	Qwen3 reasoning/thinking support

Observations & Questions

Suspected Root Causes

Reasoning Token Expansion: Is the --reasoning-parser qwen3 flag causing internal token explosion during the thinking phase?
Context Window Pressure: Could the 16384 max-model-len be insufficient, causing re-computation or re-tokenization?
Quantization Overhead: Does NVFP4 quantization introduce computation overhead during reasoning phases?
GPU Memory Thrashing: At 0.45 GPU utilization, is the model spilling to system RAM or host memory?

What I’ve Already Checked

✓ No Docker resource exhaustion (memory/CPU stable during inference)
✓ Network latency ruled out (local inference via localhost:8000)
✓ Model loads successfully; simple prompts execute quickly
✓ vLLM logs show no obvious errors or warnings

Questions for the Community

Is there a known performance regression with Qwen3.6-27B on Jetson Thor when using the reasoning parser?
Should I disable --reasoning-parser qwen3 and use a different approach to handle reasoning tokens?
Are there optimal gpu-memory-utilization or max-model-len values for this model/hardware combo that reduce latency?
Would increasing --max-num-seqs or adjusting --chunk-size help with thinking-phase throughput?
Is TensorRT optimization available for Qwen3.6 on Jetson to bypass the vLLM reasoning bottleneck?

Expected Behavior

Simple prompt (~100 tokens) → immediate response (< 5 seconds)
Complex prompt (~2000 tokens with reasoning) → response within 1–2 minutes total

Actual Behavior

Simple prompt (~100 tokens) → immediate response ✓
Complex prompt (~2000 tokens with reasoning) → 10–20 minute total latency ✗

Any insights on optimization or alternative configurations appreciated!

RunLLM · May 20, 2026, 11:29pm

There is no explicit documentation of a vLLM performance regression for Qwen3.6-27B on Jetson Thor with the reasoning parser, but similar issues have been reported for Qwen3.x models on other hardware: long prompts or high concurrency can cause severe latency, especially during the “thinking” phase, due to the hybrid architecture (DeltaNet/Mamba layers) and quantization overhead. The --reasoning-parser qwen3 flag can increase token processing during the reasoning phase, and NVFP4 quantization may further slow down prefill and thinking, particularly on ARM-based Jetson devices, which have lower memory bandwidth and less optimized kernel support compared to x86_64 platforms. Prefix caching can also introduce slowdowns or output issues for large prompts or batches, especially with Qwen3.x models and quantized weights, as seen in related reports. There is no mention of TensorRT optimization for Qwen3.6 in vLLM on Jetson Thor in the provided sources.

To mitigate latency: (1) Try disabling --reasoning-parser qwen3 to see if the bottleneck is in the reasoning extraction; (2) Lower --max-num-seqs to reduce batch pressure; (3) Increase --gpu-memory-utilization cautiously, monitoring for OOM, as too low a value can cause excessive swapping or recomputation; (4) Consider disabling --enable-prefix-caching, as it can degrade performance with large prompts on Qwen3.x; (5) There is no documented --chunk-size parameter in vLLM serve CLI, so chunking cannot be tuned directly; (6) For Jetson Thor, ensure you use the latest vLLM nightly and compatible CUDA/ROCm stack, but expect lower throughput than on server GPUs. For more detail or step-by-step tuning, let me know.

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
vLLM Qwen3.6-27B Extended Latency on Jetson Thor 128GB with Large Prompts General	1	88	May 20, 2026
Suggestion to improve inferencing speed Model Support	17	767	March 11, 2026
Jetson Orin + vLLM Qwen3-0.6B quantized models – GPU active but no speedup, need optimization tips General	1	103	April 20, 2026
Disabling reasoning of Qwen3-VL-8B-Thinking per request Model Support	1	3685	October 29, 2025
Help needed with vLLM Qwen3-32B inference performance tuning (long input slow) General	1	1373	May 26, 2025