How to check the error stacktrace

navyad · October 8, 2025, 1:01pm

vllm serve /home/ubuntu/models/RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16 --host 0.0.0.0 --port 8000 --tensor-parallel-size 4 --max-model-len 4096 --gpu-memory-utilization 0.85 --dtype auto

Error
(VllmWorker rank=2 pid=10434) Process SpawnProcess-1:3:
CRITICAL 10-08 12:57:00 [multiproc_executor.py:49] MulitprocExecutor got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 10-08 12:57:00 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
(VllmWorker rank=2 pid=10434) Traceback (most recent call last):
Killed

RunLLM · October 8, 2025, 1:02pm

Your error message “MulitprocExecutor got fatal signal from worker processes, shutting down” with “Killed” during vLLM multi-GPU (tensor-parallel-size=4) quantized model loading is most likely due to out-of-memory (OOM) or insufficient shared memory for inter-process communication, especially when using Docker or large models. This is a common issue when the available GPU memory or /dev/shm is not large enough for the model and parallelism requested, or when the quantized model and tensor parallelism are not fully compatible.

For quantized models, vLLM does not support tensor parallelism (TP) with prequantized BitsAndBytes models; you must use pipeline parallelism (PP) instead. If you are not using BitsAndBytes quantization, you may still be running out of memory or shared memory. Try reducing tensor-parallel-size, lowering max-model-len, or increasing the shared memory size (e.g., --shm-size=24g in Docker). For more details, see the error and workaround discussions in vLLM issue #13297 and vLLM issue #10034.

Would you like a step-by-step troubleshooting guide or example commands for your setup?

Sources:

Hint: Mention @RunLLM in the post for followups.