We have been seeing inconsistent SIGSEGV with both custom model impl + Qwen3 on vllm-0.10.0rc2. Was wondering if anyone has figured out the root cause or if this is addressed in later versions?
There are many related github issues, none of which seem to be resolved:
Trace
(RewardLoopWorker pid=3555498) WARNING:2026-01-19 22:51:41,021:We did not manage to extract a prediction in the correct format. Gold: ['\\boxed{B}'], Pred: ['']
(RewardLoopWorker pid=3555498) WARNING:2026-01-19 22:51:41,022:We did not manage to extract a prediction in the correct format. Gold: ['\\boxed{B. 8}'], Pred: ['']
(RewardLoopWorker pid=3555498) WARNING:2026-01-19 22:51:41,024:We did not manage to extract a prediction in the correct format. Gold: ['\\boxed{8}'], Pred: ['']
(WorkerDict pid=3546913) *** SIGSEGV received at time=1768863357 on cpu 79 ***
(WorkerDict pid=3546913) PC: @ 0x4 (unknown) (unknown)
(WorkerDict pid=3546913) @ 0x7ffff7cd2520 4000 (unknown)
(WorkerDict pid=3546913) @ 0x7fce4b9d3155 352 (unknown)
(WorkerDict pid=3546913) @ 0x7fce4b9dde19 1344 (unknown)
(WorkerDict pid=3546913) @ 0x7fce4bac1c6c 224 (unknown)
(WorkerDict pid=3546913) @ 0x7fffdb82094b 64 (unknown)
(WorkerDict pid=3546913) @ 0x7fffdb87c01b 224 cudaGraphLaunch
(WorkerDict pid=3546913) @ 0x7fceef0ca81a (unknown) at::cuda::CUDAGraph::replay()
(WorkerDict pid=3546913) @ 0x7fbb7a668130 (unknown) (unknown)
(WorkerDict pid=3546913) @ 0x4 (unknown) (unknown)
(WorkerDict pid=3546913) [2026-01-19 22:55:57,644 E 3546913 3548010] logging.cc:474: *** SIGSEGV received at time=1768863357 on cpu 79 ***
(WorkerDict pid=3546913) [2026-01-19 22:55:57,651 E 3546913 3548010] logging.cc:474: PC: @ 0x4 (unknown) (unknown)
(WorkerDict pid=3546913) [2026-01-19 22:55:57,655 E 3546913 3548010] logging.cc:474: @ 0x7ffff7cd2520 4000 (unknown)
(WorkerDict pid=3546913) [2026-01-19 22:55:57,655 E 3546913 3548010] logging.cc:474: @ 0x7fce4b9d3155 352 (unknown)
(WorkerDict pid=3546913) [2026-01-19 22:55:57,655 E 3546913 3548010] logging.cc:474: @ 0x7fce4b9dde19 1344 (unknown)
(WorkerDict pid=3546913) [2026-01-19 22:55:57,655 E 3546913 3548010] logging.cc:474: @ 0x7fce4bac1c6c 224 (unknown)
(WorkerDict pid=3546913) [2026-01-19 22:55:57,655 E 3546913 3548010] logging.cc:474: @ 0x7fffdb82094b 64 (unknown)
(WorkerDict pid=3546913) [2026-01-19 22:55:57,655 E 3546913 3548010] logging.cc:474: @ 0x7fffdb87c01b 224 cudaGraphLaunch
(WorkerDict pid=3546913) [2026-01-19 22:55:57,655 E 3546913 3548010] logging.cc:474: @ 0x7fceef0ca81a (unknown) at::cuda::CUDAGraph::replay()
(WorkerDict pid=3546913) [2026-01-19 22:55:57,659 E 3546913 3548010] logging.cc:474: @ 0x7fbb7a668130 (unknown) (unknown)
(WorkerDict pid=3546913) [2026-01-19 22:55:57,666 E 3546913 3548010] logging.cc:474: @ 0x4 (unknown) (unknown)
(WorkerDict pid=3546913) Fatal Python error: Segmentation fault
(WorkerDict pid=3546913)
(WorkerDict pid=3546913) Stack (most recent call first):
(WorkerDict pid=3546913) File "/workspace/verl_trainer/.venv/lib/python3.12/site-packages/torch/cuda/graphs.py", line 117 in replay
(WorkerDict pid=3546913) File "/workspace/Zvllm/vllm/compilation/cuda_graph.py", line 198 in __call__
(WorkerDict pid=3546913) File "/workspace/Zvllm/vllm/v1/worker/gpu_model_runner.py", line 2298 in execute_model
(WorkerDict pid=3546913) File "/workspace/verl_trainer/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120 in decorate_context
(WorkerDict pid=3546913) File "/workspace/Zvllm/vllm/v1/worker/gpu_worker.py", line 447 in execute_model
(WorkerDict pid=3546913) File "/workspace/verl_trainer/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120 in decorate_context
(WorkerDict pid=3546913) File "/workspace/Zvllm/vllm/utils/__init__.py", line 3122 in run_method
(WorkerDict pid=3546913) File "/workspace/Zvllm/vllm/worker/worker_base.py", line 267 in execute_method
(WorkerDict pid=3546913) File "/tmp/ray/session_2026-01-19_10-36-59_738292_3529560/runtime_resources/working_dir_files/_ray_pkg_385e62c0114024ab/verl/workers/rollout/vllm_rollout/vllm_rollout.py", line 224 in _execute_method
(WorkerDict pid=3546913) File "/tmp/ray/session_2026-01-19_10-36-59_738292_3529560/runtime_resources/working_dir_files/_ray_pkg_385e62c0114024ab/verl/workers/rollout/vllm_rollout/vllm_rollout.py", line 177 in _loop_forever
(WorkerDict pid=3546913) File "/workspace/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py", line 1012 in run
(WorkerDict pid=3546913) File "/workspace/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
(WorkerDict pid=3546913) File "/workspace/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py", line 1032 in _bootstrap
(WorkerDict pid=3546913)
(WorkerDict pid=3546913) Extension modules: msgpack._cmsgpack, psutil._psutil_linux, google._upb._message, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, regex._regex, markupsafe._speedups, PIL._imaging, scipy._lib._ccallback_c, scipy.linalg._fblas, scipy.linalg._flapack, _cyutility, scipy._cyutility, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_schur_sqrtm, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._slsqplib, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ellip_harm_2, scipy.special._special_ufuncs, scipy.special._gufuncs, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._hausdorff, scipy.spatial._distance_wrap, scipy.spatial.transform._rotation, scipy.spatial.transform._rigid_transform, scipy.optimize._direct, PIL._imagingft, av._core, av.logging, av.bytesource, av.buffer, av.audio.format, av.error, av.dictionary, av.container.pyio, av.option, av.descriptor, av.format, av.utils, av.stream, av.container.streams, av.sidedata.motionvectors, av.sidedata.sidedata, av.opaque, av.packet, av.container.input, av.container.output, av.container.core, av.codec.context, av.video.format, av.video.reformatter, av.plane, av.video.plane, av.video.frame, av.video.stream, av.codec.hwaccel, av.codec.codec, av.frame, av.audio.layout, av.audio.plane, av.audio.frame, av.audio.stream, av.filter.link, av.filter.context, av.filter.graph, av.filter.filter, av.filter.loudnorm, av.audio.resampler, av.audio.codeccontext, av.audio.fifo, av.bitstream, av.video.codeccontext, _cffi_backend, _cbor2, setproctitle._setproctitle, zmq.backend.cython._zmq, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, xxhash._xxhash, pyarrow._acero, pyarrow._csv, pyarrow._json, pyarrow._substrait, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, nvtx._lib.lib, nvtx._lib.profiler, sentencepiece._sentencepiece, msgspec._core, pybase64._pybase64, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.typing.builtins.itertools, numba.cpython.builtins.math, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, vllm.cumem_allocator, cuda_utils, __triton_launcher (total: 221)
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Lease ID: 06000000610da2e5ba04608c250235880f429b08ae2b17471bd777123ce43c68 Worker ID: 72612bb7364a1087afc287488e22d902eb4bcaaaadb5630f2296d358 Node ID: 75d0cfca4322a528d3ea520a8419dd8f8b02f7f42571bda6252c0bc9 Worker IP address: 10.28.5.11 Worker port: 10199 Worker PID: 3546913 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. Some common causes include: (1) the process was killed by the OOM killer due to high memory usage, (2) ray stop --force was called, or (3) the worker crashed unexpectedly due to SIGSEGV or another unexpected error.
I’ve seen the same trace with both triton and flashattn attention backends, inconsistent, sometimes 5 hours into an eval.