vLLM Qwen3.6-27B Extended Latency on Jetson Thor 128GB with Large Prompts

Chirag · May 20, 2026, 6:14pm

Hardware Setup

Device: NVIDIA Jetson Thor 128GB
Inference Framework: vLLM (nightly-aarch64)
Model: sakamakismile/Qwen3.6-27B-NVFP4 (quantized)
Interface: Open WebUI

Problem Statement

When running inference with larger prompts (beyond a certain threshold), the model exhibits unexpected latency spikes:

Initial Thinking Phase: 5–10 minutes before thinking/reasoning output appears
Thinking Processing: Additional 5–10 minutes to complete the thinking process
Final Response: Generation of the actual response follows

Simple/short prompts work correctly with normal response times, but complex or token-heavy prompts trigger this behavior consistently.

Current Configuration

docker run --rm \
  --name vllm-qwen36 \
  --device nvidia.com/gpu=all \
  --network host \
  --ipc=host \
  --shm-size=16g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -e HF_TOKEN=<hf_token> \
  -e HF_HOME=/root/.cache/huggingface \
  -v /root/ai-stack/hf-cache:/root/.cache/huggingface \
  --entrypoint bash \
  vllm/vllm-openai:nightly-aarch64 \
  -c "pip install -q 'vllm[audio]' && vllm serve sakamakismile/Qwen3.6-27B-NVFP4 \
      --host 0.0.0.0 \
      --port 8000 \
      --gpu-memory-utilization 0.45 \
      --max-model-len 16384 \
      --max-num-seqs 8 \
      --enable-prefix-caching \
      --reasoning-parser qwen3 \
      --enable-auto-tool-choice \
      --tool-call-parser qwen3_coder \
      --trust-remote-code"

Key Configuration Parameters

Parameter	Value	Rationale
`gpu-memory-utilization`	0.45	Conservative to avoid OOM on quantized model
`max-model-len`	16384	Limited context window
`max-num-seqs`	8	Batch size constraint
`enable-prefix-caching`	✓	Enabled for efficiency
`reasoning-parser`	qwen3	Qwen3 reasoning/thinking support

Observations & Questions

Suspected Root Causes

Reasoning Token Expansion: Is the --reasoning-parser qwen3 flag causing internal token explosion during the thinking phase?
Context Window Pressure: Could the 16384 max-model-len be insufficient, causing re-computation or re-tokenization?
Quantization Overhead: Does NVFP4 quantization introduce computation overhead during reasoning phases?
GPU Memory Thrashing: At 0.45 GPU utilization, is the model spilling to system RAM or host memory?

What I’ve Already Checked

✓ No Docker resource exhaustion (memory/CPU stable during inference)
✓ Network latency ruled out (local inference via localhost:8000)
✓ Model loads successfully; simple prompts execute quickly
✓ vLLM logs show no obvious errors or warnings

Questions for the Community

Is there a known performance regression with Qwen3.6-27B on Jetson Thor when using the reasoning parser?
Should I disable --reasoning-parser qwen3 and use a different approach to handle reasoning tokens?
Are there optimal gpu-memory-utilization or max-model-len values for this model/hardware combo that reduce latency?
Would increasing --max-num-seqs or adjusting --chunk-size help with thinking-phase throughput?
Is TensorRT optimization available for Qwen3.6 on Jetson to bypass the vLLM reasoning bottleneck?

Expected Behavior

Simple prompt (~100 tokens) → immediate response (< 5 seconds)
Complex prompt (~2000 tokens with reasoning) → response within 1–2 minutes total

Actual Behavior

Simple prompt (~100 tokens) → immediate response ✓
Complex prompt (~2000 tokens with reasoning) → 10–20 minute total latency ✗

Any insights on optimization or alternative configurations appreciated!

RunLLM · May 20, 2026, 11:29pm

Your symptoms—very slow “thinking” (reasoning) phase for large prompts on Qwen3.6-27B-NVFP4 with vLLM on Jetson Thor—are a known issue with Qwen3.5/3.6 models, especially when using the reasoning parser and prefix caching. The main causes are: (1) the hybrid Gated DeltaNet architecture in Qwen3.5/3.6, which makes prefill (initial prompt processing) inherently slower and less parallelizable, and (2) prefix caching and reasoning parsing, which can further fragment prefill and increase latency for long prompts. This is not specific to Jetson Thor; similar latency spikes are reported on A100, L20, and other GPUs for these models with long prompts and reasoning enabled. NVFP4 quantization does not significantly worsen this, but low --gpu-memory-utilization can exacerbate the issue by reducing available KV cache and increasing recomputation pressure. See Qwen3.5/3.6 Usage Guide, performance regression discussion, and batch inference slowness.

Disabling --reasoning-parser qwen3 (or setting --default-chat-template-kwargs ‘{“enable_thinking”: false}’) will eliminate the “thinking” phase and restore normal prompt-to-response latency, but you lose explicit reasoning output. Increasing --gpu-memory-utilization (if memory allows) and reducing --max-num-seqs (to minimize batch fragmentation) can help, but will not fully resolve the architectural bottleneck. There is no TensorRT path for Qwen3.6 reasoning in vLLM as of now. For best performance, use “nothink” mode (no reasoning parser), avoid prefix caching for large batches, and keep prompt lengths moderate. If you need reasoning output, expect high latency for long prompts—this is a model and architecture limitation, not a vLLM bug.

Would you like a step-by-step breakdown of configuration changes and their effects?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
vLLM Qwen3.6-27B Extended Latency on Jetson Thor 128GB with Large Prompts General	1	71	May 20, 2026
Suggestion to improve inferencing speed Model Support	17	767	March 11, 2026
Disabling reasoning of Qwen3-VL-8B-Thinking per request Model Support	1	3685	October 29, 2025
Help needed with vLLM Qwen3-32B inference performance tuning (long input slow) General	1	1373	May 26, 2025
Jetson Orin + vLLM Qwen3-0.6B quantized models – GPU active but no speedup, need optimization tips General	1	103	April 20, 2026