Environment:
-
GPU: NVIDIA GeForce RTX 5090 (Blackwell /
sm_120) -
VRAM: 32 GB
-
Driver Version: 580.95.05
-
CUDA Version: 13.0 (Runtime 12.9)
-
Python Version: 3.12
-
vLLM Version: 0.11.2
-
Model: Qwen3-8B (merged with LoRA) quantized in 8-bit via
bitsandbytes.
I am encountering critical stability issues when running vLLM 0.11.2 on the new RTX 5090 architecture. Under load (stress test with 30 concurrent users), the server blokcs unitl i force reboot (Kernel Panic) or deadlocks (GPU-Util drops to 0% while VRAM remains full).
Error 1: Segmentation Fault during initialization
When launching vLLM with default settings (V1 Engine enabled), I get a segfault during the model warmup/CUDA graph capture phase:
!!!!!!! Segfault encountered !!!!!!!
File "<unknown>", line 0, in c10::intrusive_ptr<torch::impl::PythonSymNodeImpl, c10::detail::intrusive_target_default_null_type<torch::impl::PythonSymNodeImpl> >
c10::intrusive_ptr<torch::impl::PythonSymNodeImpl, c10::detail::intrusive_target_default_null_type<torch::impl::PythonSymNodeImpl> >::make<pybind11::object&>(pybind11::object&)
RuntimeError: Engine core initialization failed. Failed core proc(s): {'EngineCore_DP0': -11}
Error 2: SystemError in Shared Memory Broadcast
If the engine manages to start, it deadlocks after processing ~50 requests with the following error:
(EngineCore_DP0 pid=186) ERROR [core.py:844] EngineCore encountered a fatal error.
SystemError: attempting to create PyCFunction with class but no METH_METHOD flag
File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 229, in get_metadata
with self.shared_memory.buf[start:end] as buf:
Is there a known incompatibility between sm_120 and the current PythonSymNodeImpl logic in PyTorch/vLLM?
-
Does
bitsandbytes8-bit quantization require specific kernels for Blackwell that might be missing in the current vLLM build? -
Why is
shm_broadcast.pytriggering aSystemErrorspecifically on Python 3.12 with this hardware?here is my docker command of vllm (the model merged with lora adapter with quantisation int8)
sudo docker run -d \
--name vllm \
--restart always \
--network network \
--runtime nvidia \
--gpus all \
--memory=32g \
--shm-size=16g \
--ipc=host \
-e VLLM_USE_V1=0 \
-e VLLM_WORKER_MULTIPROC_METHOD=spawn \
-e TOKENIZERS_PARALLELISM=false \
-v /home/qwen3-8b-v4-int8-final_1:/model \
-p 8001:8001 \
vllm/vllm-openai:latest \
--model /model \
--host 0.0.0.0 \
--port 8001 \
--load-format bitsandbytes \
--quantization bitsandbytes \
--dtype bfloat16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.80 \
--max-num-seqs 8 \
--enforce-eager \
--disable-custom-all-reduce \
--distributed-executor-backend mp \
--max-num-batched-tokens 4096 \
--disable-frontend-multiprocessing
and also the docker command (the model merged with lora adapter without quantisation)
sudo docker run -d \
--name vllm \
--restart always \
--network network \
--runtime nvidia \
--gpus all \
--ipc=host \
-v /home/qwen3-8b-v4-merged-final_1:/model \
-p 8001:8001 \
vllm/vllm-openai:latest \
--model /model \
--host 0.0.0.0 \
--port 8001 \
--dtype bfloat16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 8 \
--enforce-eager
here is a version of last logs (for when using the model merged without quantisation)
Model loading took 15.2683 GiB memory and 3.691282 seconds
!!!!!!! Segfault encountered !!!!!!!
File "<unknown>", line 0, in c10::intrusive_ptr<torch::impl::PythonSymNodeImpl, c10::detail::intrusive_target_default_null_typetorch::impl::PythonSymNodeImpl > c10::intrusive_ptr<torch::impl::PythonSymNodeImpl, c10::detail::intrusive_target_default_null_typetorch::impl::PythonSymNodeImpl >::makepybind11::object&(pybind11::object&)
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1) sys.exit(main())
(APIServer pid=1) ^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=1) args.dispatch_function(args)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 60, in cmd
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 96, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 2024, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 2043, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 195, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 236, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/utils/func_utils.py", line 116, in inner
(APIServer pid=1) return fn(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 203, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 133, in init
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 121, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 808, in init
(APIServer pid=1) super().init(
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 469, in init
(APIServer pid=1) with launch_core_engines(vllm_config, executor_class, log_stats) as (
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 144, in exit
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 907, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 964, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '