Issue serving gemma3-27b-it

akhilpandey95 · September 19, 2025, 5:14am

I have the following vllm script on a 4xh100 setup to run gemma3-27b model:

#!/bin/bash

set -e

export VLLM_ATTENTION_BACKEND=FLASH_ATTN
LOG_FILE="gemma3_27b_serve_$(date +%Y%m%d_%H%M%S).log"
CUDA_VERSION="12.8"
MODEL_PATH="./gemma-3-27b-it"
TP_SIZE=4

vllm serve "$MODEL_PATH" \
  --dtype bfloat16 \
  --tensor-parallel-size $TP_SIZE \
  --gpu-memory-utilization 0.88 \
  --max-model-len 36864 \
  --max-num-seqs 256 \
  --max-num-batched-tokens 131072 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --enforce-eager \
  --trust-remote-code \
  2>&1 | tee "$LOG_FILE"

it constantly gets stuck here or hangs and crashes out:

it constantly gets stu
Loading safetensors checkpoint shards:   8% Completed | 1/12 [00:12<02:16, 12.40s/it]
Loading safetensors checkpoint shards:  17% Completed | 2/12 [00:27<02:21, 14.20s/it]
Loading safetensors checkpoint shards:  25% Completed | 3/12 [00:49<02:38, 17.61s/it]
Loading safetensors checkpoint shards:  33% Completed | 4/12 [01:10<02:30, 18.80s/it]
Loading safetensors checkpoint shards:  42% Completed | 5/12 [01:31<02:17, 19.68s/it]
Loading safetensors checkpoint shards:  50% Completed | 6/12 [01:33<01:21, 13.58s/it]
Loading safetensors checkpoint shards:  58% Completed | 7/12 [01:53<01:18, 15.64s/it]
Loading safetensors checkpoint shards:  67% Completed | 8/12 [02:14<01:10, 17.53s/it]
Loading safetensors checkpoint shards:  75% Completed | 9/12 [02:36<00:56, 18.78s/it]
Loading safetensors checkpoint shards:  83% Completed | 10/12 [02:57<00:39, 19.69s/it]
Loading safetensors checkpoint shards:  92% Completed | 11/12 [03:19<00:20, 20.15s/it]
Loading safetensors checkpoint shards: 100% Completed | 12/12 [03:40<00:00, 20.64s/it]
Loading safetensors checkpoint shards: 100% Completed | 12/12 [03:40<00:00, 18.40s/it]
(Worker_TP0 pid=3736093) 
(Worker_TP0 pid=3736093) INFO 09-18 23:58:37 [default_loader.py:268] Loading weights took 220.86 seconds
(Worker_TP3 pid=3736096) INFO 09-18 23:58:37 [default_loader.py:268] Loading weights took 213.14 seconds
(Worker_TP1 pid=3736094) INFO 09-18 23:58:37 [default_loader.py:268] Loading weights took 215.51 seconds
(Worker_TP2 pid=3736095) INFO 09-18 23:58:37 [default_loader.py:268] Loading weights took 233.88 seconds
(Worker_TP3 pid=3736096) INFO 09-18 23:58:44 [gpu_model_runner.py:2392] Model loading took 13.2978 GiB and 215.464502 seconds
(Worker_TP0 pid=3736093) INFO 09-18 23:58:47 [gpu_model_runner.py:2392] Model loading took 13.2978 GiB and 224.230714 seconds
(Worker_TP1 pid=3736094) INFO 09-18 23:58:47 [gpu_model_runner.py:2392] Model loading took 13.2978 GiB and 219.069294 seconds
(Worker_TP2 pid=3736095) INFO 09-18 23:58:49 [gpu_model_runner.py:2392] Model loading took 13.2978 GiB and 239.016643 seconds
(Worker_TP2 pid=3736095) INFO 09-18 23:58:50 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 131072 tokens, and profiled with 504 image items of the maximum feature size.
(Worker_TP0 pid=3736093) INFO 09-18 23:58:50 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 131072 tokens, and profiled with 504 image items of the maximum feature size.
(Worker_TP1 pid=3736094) INFO 09-18 23:58:50 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 131072 tokens, and profiled with 504 image items of the maximum feature size.
(Worker_TP3 pid=3736096) INFO 09-18 23:58:50 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 131072 tokens, and profiled with 504 image items of the maximum feature size.

im using vllm version 0.10.2 and the last message i see in the logs is this:

^[[1;36m(APIServer pid=3686257)^[[0;0m raise RuntimeError("Engine core initialization failed. "
^[[1;36m(APIServer pid=3686257)^[[0;0m RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/projects/mlx/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’
/projects/mlx/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 5 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’

can you help me understand where im going wrong ?

RunLLM · September 19, 2025, 5:15am

Your vLLM script is not showing a specific error in the logs, but the “Engine core initialization failed” message after loading weights and initializing the encoder cache typically indicates an out-of-memory (OOM) or resource exhaustion issue, especially with large models and high context lengths. This is supported by similar reports for large models on multi-GPU setups, where vLLM hangs or crashes after weight loading, often due to insufficient GPU memory for the requested max-model-len, max-num-seqs, or CUDA graph capture overhead. The resource tracker warnings about leaked semaphores and shared memory objects are a side effect of the abrupt process termination, not the root cause. See troubleshooting docs and related issues.

To resolve, try lowering max-model-len, max-num-seqs, or gpu-memory-utilization, and consider disabling CUDA graphs by removing --enforce-eager (or set enforce_eager=False) to reduce memory pressure. If the model still fails, further reduce batch size or context length. For more debugging, set export VLLM_LOGGING_LEVEL=DEBUG and check for OOM or CUDA errors in the logs. See conserving memory guide for more options.

Would you like a step-by-step breakdown or example of how to adjust these parameters?

Sources:

Topic		Replies	Views
Issue running gemma-3-27b-it with vLLM version: 0.12.0 General	1	101	December 17, 2025
Problem with Gemma3 and vLLM General	11	352	December 6, 2025
Not able to run google/gemma-3n-E4B-it General	3	273	September 22, 2025
Gemma 3 Quantization General	5	1243	June 21, 2025
Trying to run gemma-3-27b-it-FP8-dynamic with rocm General	23	623	July 25, 2025

Issue serving gemma3-27b-it

Related topics