I have the following vllm script on a 4xh100 setup to run gemma3-27b model:
#!/bin/bash
set -e
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
LOG_FILE="gemma3_27b_serve_$(date +%Y%m%d_%H%M%S).log"
CUDA_VERSION="12.8"
MODEL_PATH="./gemma-3-27b-it"
TP_SIZE=4
vllm serve "$MODEL_PATH" \
--dtype bfloat16 \
--tensor-parallel-size $TP_SIZE \
--gpu-memory-utilization 0.88 \
--max-model-len 36864 \
--max-num-seqs 256 \
--max-num-batched-tokens 131072 \
--enable-chunked-prefill \
--enable-prefix-caching \
--enforce-eager \
--trust-remote-code \
2>&1 | tee "$LOG_FILE"
it constantly gets stuck here or hangs and crashes out:
it constantly gets stu
Loading safetensors checkpoint shards: 8% Completed | 1/12 [00:12<02:16, 12.40s/it]
Loading safetensors checkpoint shards: 17% Completed | 2/12 [00:27<02:21, 14.20s/it]
Loading safetensors checkpoint shards: 25% Completed | 3/12 [00:49<02:38, 17.61s/it]
Loading safetensors checkpoint shards: 33% Completed | 4/12 [01:10<02:30, 18.80s/it]
Loading safetensors checkpoint shards: 42% Completed | 5/12 [01:31<02:17, 19.68s/it]
Loading safetensors checkpoint shards: 50% Completed | 6/12 [01:33<01:21, 13.58s/it]
Loading safetensors checkpoint shards: 58% Completed | 7/12 [01:53<01:18, 15.64s/it]
Loading safetensors checkpoint shards: 67% Completed | 8/12 [02:14<01:10, 17.53s/it]
Loading safetensors checkpoint shards: 75% Completed | 9/12 [02:36<00:56, 18.78s/it]
Loading safetensors checkpoint shards: 83% Completed | 10/12 [02:57<00:39, 19.69s/it]
Loading safetensors checkpoint shards: 92% Completed | 11/12 [03:19<00:20, 20.15s/it]
Loading safetensors checkpoint shards: 100% Completed | 12/12 [03:40<00:00, 20.64s/it]
Loading safetensors checkpoint shards: 100% Completed | 12/12 [03:40<00:00, 18.40s/it]
(Worker_TP0 pid=3736093)
(Worker_TP0 pid=3736093) INFO 09-18 23:58:37 [default_loader.py:268] Loading weights took 220.86 seconds
(Worker_TP3 pid=3736096) INFO 09-18 23:58:37 [default_loader.py:268] Loading weights took 213.14 seconds
(Worker_TP1 pid=3736094) INFO 09-18 23:58:37 [default_loader.py:268] Loading weights took 215.51 seconds
(Worker_TP2 pid=3736095) INFO 09-18 23:58:37 [default_loader.py:268] Loading weights took 233.88 seconds
(Worker_TP3 pid=3736096) INFO 09-18 23:58:44 [gpu_model_runner.py:2392] Model loading took 13.2978 GiB and 215.464502 seconds
(Worker_TP0 pid=3736093) INFO 09-18 23:58:47 [gpu_model_runner.py:2392] Model loading took 13.2978 GiB and 224.230714 seconds
(Worker_TP1 pid=3736094) INFO 09-18 23:58:47 [gpu_model_runner.py:2392] Model loading took 13.2978 GiB and 219.069294 seconds
(Worker_TP2 pid=3736095) INFO 09-18 23:58:49 [gpu_model_runner.py:2392] Model loading took 13.2978 GiB and 239.016643 seconds
(Worker_TP2 pid=3736095) INFO 09-18 23:58:50 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 131072 tokens, and profiled with 504 image items of the maximum feature size.
(Worker_TP0 pid=3736093) INFO 09-18 23:58:50 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 131072 tokens, and profiled with 504 image items of the maximum feature size.
(Worker_TP1 pid=3736094) INFO 09-18 23:58:50 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 131072 tokens, and profiled with 504 image items of the maximum feature size.
(Worker_TP3 pid=3736096) INFO 09-18 23:58:50 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 131072 tokens, and profiled with 504 image items of the maximum feature size.
im using vllm version 0.10.2 and the last message i see in the logs is this:
^[[1;36m(APIServer pid=3686257)^[[0;0m raise RuntimeError("Engine core initialization failed. "
^[[1;36m(APIServer pid=3686257)^[[0;0m RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/projects/mlx/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’
/projects/mlx/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 5 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’
can you help me understand where im going wrong ?