vLLM Qwen3.6-27B Extended Latency on Jetson Thor 128GB with Large Prompts

Hardware Setup

  • Device: NVIDIA Jetson Thor 128GB

  • Inference Framework: vLLM (nightly-aarch64)

  • Model: sakamakismile/Qwen3.6-27B-NVFP4 (quantized)

  • Interface: Open WebUI

Problem Statement

When running inference with larger prompts (beyond a certain threshold), the model exhibits unexpected latency spikes:

  1. Initial Thinking Phase: 5–10 minutes before thinking/reasoning output appears

  2. Thinking Processing: Additional 5–10 minutes to complete the thinking process

  3. Final Response: Generation of the actual response follows

Simple/short prompts work correctly with normal response times, but complex or token-heavy prompts trigger this behavior consistently.

Current Configuration

docker run --rm \
  --name vllm-qwen36 \
  --device nvidia.com/gpu=all \
  --network host \
  --ipc=host \
  --shm-size=16g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -e HF_TOKEN=<hf_token> \
  -e HF_HOME=/root/.cache/huggingface \
  -v /root/ai-stack/hf-cache:/root/.cache/huggingface \
  --entrypoint bash \
  vllm/vllm-openai:nightly-aarch64 \
  -c "pip install -q 'vllm[audio]' && vllm serve sakamakismile/Qwen3.6-27B-NVFP4 \
      --host 0.0.0.0 \
      --port 8000 \
      --gpu-memory-utilization 0.45 \
      --max-model-len 16384 \
      --max-num-seqs 8 \
      --enable-prefix-caching \
      --reasoning-parser qwen3 \
      --enable-auto-tool-choice \
      --tool-call-parser qwen3_coder \
      --trust-remote-code"


Key Configuration Parameters

Parameter Value Rationale
gpu-memory-utilization 0.45 Conservative to avoid OOM on quantized model
max-model-len 16384 Limited context window
max-num-seqs 8 Batch size constraint
enable-prefix-caching Enabled for efficiency
reasoning-parser qwen3 Qwen3 reasoning/thinking support

Observations & Questions

Suspected Root Causes

  1. Reasoning Token Expansion: Is the --reasoning-parser qwen3 flag causing internal token explosion during the thinking phase?

  2. Context Window Pressure: Could the 16384 max-model-len be insufficient, causing re-computation or re-tokenization?

  3. Quantization Overhead: Does NVFP4 quantization introduce computation overhead during reasoning phases?

  4. GPU Memory Thrashing: At 0.45 GPU utilization, is the model spilling to system RAM or host memory?

What I’ve Already Checked

  • ✓ No Docker resource exhaustion (memory/CPU stable during inference)

  • ✓ Network latency ruled out (local inference via localhost:8000)

  • ✓ Model loads successfully; simple prompts execute quickly

  • ✓ vLLM logs show no obvious errors or warnings

Questions for the Community

  1. Is there a known performance regression with Qwen3.6-27B on Jetson Thor when using the reasoning parser?

  2. Should I disable --reasoning-parser qwen3 and use a different approach to handle reasoning tokens?

  3. Are there optimal gpu-memory-utilization or max-model-len values for this model/hardware combo that reduce latency?

  4. Would increasing --max-num-seqs or adjusting --chunk-size help with thinking-phase throughput?

  5. Is TensorRT optimization available for Qwen3.6 on Jetson to bypass the vLLM reasoning bottleneck?

Expected Behavior

  • Simple prompt (~100 tokens) → immediate response (< 5 seconds)

  • Complex prompt (~2000 tokens with reasoning) → response within 1–2 minutes total

Actual Behavior

  • Simple prompt (~100 tokens) → immediate response ✓

  • Complex prompt (~2000 tokens with reasoning) → 10–20 minute total latency ✗


Any insights on optimization or alternative configurations appreciated!

Your symptoms—very slow “thinking” (reasoning) phase for large prompts on Qwen3.6-27B-NVFP4 with vLLM on Jetson Thor—are a known issue with Qwen3.5/3.6 models, especially when using the reasoning parser and prefix caching. The main causes are: (1) the hybrid Gated DeltaNet architecture in Qwen3.5/3.6, which makes prefill (initial prompt processing) inherently slower and less parallelizable, and (2) prefix caching and reasoning parsing, which can further fragment prefill and increase latency for long prompts. This is not specific to Jetson Thor; similar latency spikes are reported on A100, L20, and other GPUs for these models with long prompts and reasoning enabled. NVFP4 quantization does not significantly worsen this, but low --gpu-memory-utilization can exacerbate the issue by reducing available KV cache and increasing recomputation pressure. See Qwen3.5/3.6 Usage Guide, performance regression discussion, and batch inference slowness.

Disabling --reasoning-parser qwen3 (or setting --default-chat-template-kwargs ‘{“enable_thinking”: false}’) will eliminate the “thinking” phase and restore normal prompt-to-response latency, but you lose explicit reasoning output. Increasing --gpu-memory-utilization (if memory allows) and reducing --max-num-seqs (to minimize batch fragmentation) can help, but will not fully resolve the architectural bottleneck. There is no TensorRT path for Qwen3.6 reasoning in vLLM as of now. For best performance, use “nothink” mode (no reasoning parser), avoid prefix caching for large batches, and keep prompt lengths moderate. If you need reasoning output, expect high latency for long prompts—this is a model and architecture limitation, not a vLLM bug.

Would you like a step-by-step breakdown of configuration changes and their effects?

Sources:

Hint: Mention @RunLLM in the post for followups.