OutOfMemoryError vLLM cant see the max memory available

My igpu has allocated 42GB of vram

sudo dmesg | grep -i “GTT”
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.14.0-123037-tuxedo root=/dev/mapper/system-root ro quiet splash iommu=pt amdgpu.cwsr_enable=0 amdttm.pages_limit=11010048 amdttm.page_pool_size=11010048 amdgpu.gttsize=43008 loglevel=3 udev.log_level=3 kvm.enable_virt_at_load=0 vt.handoff=7
[ 0.052020] Kernel command line: BOOT_IMAGE=/vmlinuz-6.14.0-123037-tuxedo root=/dev/mapper/system-root ro quiet splash iommu=pt amdgpu.cwsr_enable=0 amdttm.pages_limit=11010048 amdttm.page_pool_size=11010048 amdgpu.gttsize=43008 loglevel=3 udev.log_level=3 kvm.enable_virt_at_load=0 vt.handoff=7
[ 18.799825] [drm] amdgpu: 43008M of GTT memory ready.

But vLLM 0.19.1 sees:

torch.OutOfMemoryError: HIP out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 30.76 GiB of which 1.54 MiB is free. Of the allocated memory 30.34 GiB is allocated by PyTorch, and 96.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (CUDA semantics — PyTorch 2.11 documentation)

docker run -it
–dns=192.168.49.1
–network=host
–group-add=video
–ipc=host
–cap-add=SYS_PTRACE
–security-opt seccomp=unconfined
–device /dev/kfd
–device /dev/dri
–shm-size=4g
-e HUGGING_FACE_HUB_TOKEN=“”
-e VLLM_SLEEP_WHEN_IDLE=1
-e ROCM_VISIBLE_DEVICES=0
-e HIP_VISIBLE_DEVICES=0
-e HSA_OVERRIDE_GFX_VERSION=11.5.0
-e PYTORCH_ROCM_ARCH=“gfx1150”
-e ROC_ALLOW_MAX_QUERY_SIZE=1
-e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
-e VLLM_USE_TRITON_FLASH_ATTN=0
-e GPU_MAX_HW_QUEUES=1
–restart=unless-stopped
–name vllm_google_gemma-4-26B-A4B-it-1req
–entrypoint /bin/bash
-v /home/antti/vllm_models:/root/.cache/huggingface
-v /home/antti/documentation:/app/docker_shared
vllm/vllm-openai-rocm:latest
-c “sudo apt update && sudo apt-get install -y nano && pip install --upgrade ‘transformers>=5.5.4’ pandas tqdm scikit-learn && vllm serve google/gemma-4-26B-A4B-it
–host 0.0.0.0
–port 8000
–served-model-name gemma3
–trust-remote-code
–tensor-parallel-size 1
–attention-backend TRITON_ATTN
–dtype bfloat16
–kv-cache-dtype auto
–max-model-len 24000
–max-num-seqs 1
–max-num-batched-tokens 2000
–gpu-memory-utilization 0.70
–enable_chunked_prefill
–max-log-len 100”

vLLM and PyTorch only see the VRAM that is available to the ROCm runtime, not the full GTT (Graphics Translation Table) memory reported by the kernel or amdgpu driver. GTT is system memory mapped for GPU use, but vLLM relies on the memory reported by HIP/ROCm, which is typically the dedicated VRAM, not the full GTT pool. This is why you see only ~30 GiB available in vLLM, even though your kernel parameters allocate 42 GiB of GTT. This is a known limitation and not directly configurable from vLLM or PyTorch; it’s determined by the ROCm driver and hardware support for unified memory on your iGPU platform. See the vLLM troubleshooting guide and related ROCm issues for more details.

If you want to maximize usable memory, ensure no other processes are using GPU memory, and try reducing --max-model-len, --max-num-seqs, or --gpu-memory-utilization in your vLLM command. However, vLLM cannot use more memory than what ROCm/HIP reports as available, regardless of GTT settings. Would you like a step-by-step breakdown of how vLLM and ROCm manage memory on AMD GPUs?

Sources:

Hint: Mention @RunLLM in the post for followups.