Getting flashinfer.jit: [Autotuner]: OOM detected

AnttiK · April 18, 2026, 4:37pm

Hi, I am getting this:
flashinfer.jit: [Autotuner]: OOM detected, falling back to default tactic

With 2x 5090 and vllm/vllm-openai:gemma4-cu130

-e VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
-e NCCL_P2P_DISABLE=1
–shm-size=64g
–ulimit memlock=-1
–ulimit stack=67108864
–restart unless-stopped
–name vllm_nvidia_Gemma-4-31B-IT-NVFP4
gemma4-vllm-container3
–model nvidia/Gemma-4-31B-IT-NVFP4
–served-model-name gemma3
–dtype auto
–max-model-len 7096
–max-num-seqs 1
–gpu-memory-utilization 0.93
–enable-prefix-caching
–enable-chunked-prefill
–tensor-parallel-size 2
–quantization nvfp4
–kv-cache-dtype fp8
–quantization modelopt
–max-num-batched-tokens 20000
–max-log-len 100

RunLLM · April 18, 2026, 4:37pm

The message flashinfer.jit: [Autotuner]: OOM detected, falling back to default tactic means FlashInfer’s kernel autotuner ran out of GPU memory while benchmarking optimized kernels, so it reverted to a less memory-intensive (but possibly slower) default kernel. This is not a fatal error—serving will continue, but you may see lower throughput or higher latency. To avoid this, reduce --max-num-batched-tokens, lower --gpu-memory-utilization, or use a smaller --max-model-len to free up memory for autotuning, as recommended in the Gemma 4 usage guide and memory troubleshooting docs.

Would you like more detail on how to tune these parameters for your hardware?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Cuda failure 'out of memory' General	1	1568	April 24, 2025
How to understand OOM and foresee memory usage General	5	47	April 24, 2026
torch.OutOfMemoryError: CUDA out of memory General	15	964	March 3, 2026
FlashInfer latest version is not working with vLLM General	11	2425	August 20, 2025
OutOfMemoryError vLLM cant see the max memory available General	1	53	April 21, 2026

Getting flashinfer.jit: [Autotuner]: OOM detected

Related topics