My igpu has allocated 42GB of vram
sudo dmesg | grep -i “GTT”
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.14.0-123037-tuxedo root=/dev/mapper/system-root ro quiet splash iommu=pt amdgpu.cwsr_enable=0 amdttm.pages_limit=11010048 amdttm.page_pool_size=11010048 amdgpu.gttsize=43008 loglevel=3 udev.log_level=3 kvm.enable_virt_at_load=0 vt.handoff=7
[ 0.052020] Kernel command line: BOOT_IMAGE=/vmlinuz-6.14.0-123037-tuxedo root=/dev/mapper/system-root ro quiet splash iommu=pt amdgpu.cwsr_enable=0 amdttm.pages_limit=11010048 amdttm.page_pool_size=11010048 amdgpu.gttsize=43008 loglevel=3 udev.log_level=3 kvm.enable_virt_at_load=0 vt.handoff=7
[ 18.799825] [drm] amdgpu: 43008M of GTT memory ready.
But vLLM 0.19.1 sees:
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 30.76 GiB of which 1.54 MiB is free. Of the allocated memory 30.34 GiB is allocated by PyTorch, and 96.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (CUDA semantics — PyTorch 2.11 documentation)
docker run -it
–dns=192.168.49.1
–network=host
–group-add=video
–ipc=host
–cap-add=SYS_PTRACE
–security-opt seccomp=unconfined
–device /dev/kfd
–device /dev/dri
–shm-size=4g
-e HUGGING_FACE_HUB_TOKEN=“”
-e VLLM_SLEEP_WHEN_IDLE=1
-e ROCM_VISIBLE_DEVICES=0
-e HIP_VISIBLE_DEVICES=0
-e HSA_OVERRIDE_GFX_VERSION=11.5.0
-e PYTORCH_ROCM_ARCH=“gfx1150”
-e ROC_ALLOW_MAX_QUERY_SIZE=1
-e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
-e VLLM_USE_TRITON_FLASH_ATTN=0
-e GPU_MAX_HW_QUEUES=1
–restart=unless-stopped
–name vllm_google_gemma-4-26B-A4B-it-1req
–entrypoint /bin/bash
-v /home/antti/vllm_models:/root/.cache/huggingface
-v /home/antti/documentation:/app/docker_shared
vllm/vllm-openai-rocm:latest
-c “sudo apt update && sudo apt-get install -y nano && pip install --upgrade ‘transformers>=5.5.4’ pandas tqdm scikit-learn && vllm serve google/gemma-4-26B-A4B-it
–host 0.0.0.0
–port 8000
–served-model-name gemma3
–trust-remote-code
–tensor-parallel-size 1
–attention-backend TRITON_ATTN
–dtype bfloat16
–kv-cache-dtype auto
–max-model-len 24000
–max-num-seqs 1
–max-num-batched-tokens 2000
–gpu-memory-utilization 0.70
–enable_chunked_prefill
–max-log-len 100”