Getting flashinfer.jit: [Autotuner]: OOM detected

Hi, I am getting this:
flashinfer.jit: [Autotuner]: OOM detected, falling back to default tactic

With 2x 5090 and vllm/vllm-openai:gemma4-cu130

-e VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
-e NCCL_P2P_DISABLE=1
–shm-size=64g
–ulimit memlock=-1
–ulimit stack=67108864
–restart unless-stopped
–name vllm_nvidia_Gemma-4-31B-IT-NVFP4
gemma4-vllm-container3
–model nvidia/Gemma-4-31B-IT-NVFP4
–served-model-name gemma3
–dtype auto
–max-model-len 7096
–max-num-seqs 1
–gpu-memory-utilization 0.93
–enable-prefix-caching
–enable-chunked-prefill
–tensor-parallel-size 2
–quantization nvfp4
–kv-cache-dtype fp8
–quantization modelopt
–max-num-batched-tokens 20000
–max-log-len 100

The message flashinfer.jit: [Autotuner]: OOM detected, falling back to default tactic means FlashInfer’s kernel autotuner ran out of GPU memory while benchmarking optimized kernels, so it reverted to a less memory-intensive (but possibly slower) default kernel. This is not a fatal error—serving will continue, but you may see lower throughput or higher latency. To avoid this, reduce --max-num-batched-tokens, lower --gpu-memory-utilization, or use a smaller --max-model-len to free up memory for autotuning, as recommended in the Gemma 4 usage guide and memory troubleshooting docs.

Would you like more detail on how to tune these parameters for your hardware?

Sources:

Hint: Mention @RunLLM in the post for followups.