Getting flashinfer.jit: [Autotuner]: OOM detected

Hi, I am getting this:
flashinfer.jit: [Autotuner]: OOM detected, falling back to default tactic

With 2x 5090 and vllm/vllm-openai:gemma4-cu130

-e VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
-e NCCL_P2P_DISABLE=1
–shm-size=64g
–ulimit memlock=-1
–ulimit stack=67108864
–restart unless-stopped
–name vllm_nvidia_Gemma-4-31B-IT-NVFP4
gemma4-vllm-container3
–model nvidia/Gemma-4-31B-IT-NVFP4
–served-model-name gemma3
–dtype auto
–max-model-len 7096
–max-num-seqs 1
–gpu-memory-utilization 0.93
–enable-prefix-caching
–enable-chunked-prefill
–tensor-parallel-size 2
–quantization nvfp4
–kv-cache-dtype fp8
–quantization modelopt
–max-num-batched-tokens 20000
–max-log-len 100

The message flashinfer.jit: [Autotuner]: OOM detected, falling back to default tactic means FlashInfer’s kernel autotuner ran out of GPU memory while benchmarking optimized kernels, so it reverted to a less memory-intensive (but possibly slower) default kernel. This is not a fatal error—serving will continue, but you may see lower throughput or higher latency. To avoid this, reduce --max-num-batched-tokens, lower --gpu-memory-utilization, or use a smaller --max-model-len to free up memory for autotuning, as recommended in the Gemma 4 usage guide and memory troubleshooting docs.

Would you like more detail on how to tune these parameters for your hardware?

Sources:

Hint: Mention @RunLLM in the post for followups.

BUMP
encountered this issue on 0.20 - was not present on 0.19
the model was able to load successfully in the end

vllm serve ./models/sakamakismile-Qwen3.6-27B-Text-NVFP4-MTP \
  --tensor-parallel-size 2 \
  --max-model-len 64K \
  --max-num-batched-tokens 8K \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --quantization modelopt \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}' \
  --reasoning-parser qwen3 \
  --language-model-only

(Worker_TP0 pid=1442) 2026-05-01 12:34:20,455 - WARNING - autotuner.py:861 - flashinfer.jit: [Autotuner]: OOM detected, falling back to default tactic
(Worker_TP1 pid=1443) 2026-05-01 12:34:20,460 - WARNING - autotuner.py:861 - flashinfer.jit: [Autotuner]: OOM detected, falling back to default tactic

EDIT: i see there is an open PR to fix this issue: [Bugfix] Run FlashInfer autotuning before KV cache allocation by bhoomit · Pull Request #40887 · vllm-project/vllm · GitHub

got it working loweing batched tokens value