Persistent segfaults/SIGSEGV

rawsh · January 20, 2026, 12:05am

We have been seeing inconsistent SIGSEGV with both custom model impl + Qwen3 on vllm-0.10.0rc2. Was wondering if anyone has figured out the root cause or if this is addressed in later versions?

There are many related github issues, none of which seem to be resolved:

github.com/vllm-project/vllm

[Bug]: ray cluster Segmentation fault

opened 02:47PM - 03 Jul 24 UTC

closed 02:08AM - 24 Nov 24 UTC

warlockedward

bug stale

### Your current environment ```text The output of `python collect_env.py` …PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.29.6 Libc version: glibc-2.35 Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-113-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: Tesla V100-SXM2-32GB GPU 1: Tesla V100-SXM2-32GB GPU 2: Tesla V100-SXM2-32GB GPU 3: Tesla V100-SXM2-32GB GPU 4: Tesla V100-SXM2-32GB GPU 5: Tesla V100-SXM2-32GB GPU 6: Tesla V100-SXM2-32GB GPU 7: Tesla V100-SXM2-32GB Nvidia driver version: 535.183.01 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0-95 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz CPU family: 6 Model: 85 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 2 Stepping: 7 BogoMIPS: 5200.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp_epp pku ospke avx512_vnni md_clear flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 1.5 MiB (48 instances) L1i cache: 1.5 MiB (48 instances) L2 cache: 48 MiB (48 instances) L3 cache: 66 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-23,48-71 NUMA node1 CPU(s): 24-47,72-95 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI Syscall hardening, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; TSX disabled Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] torch==2.3.0 [pip3] transformers==4.42.1 [pip3] triton==2.3.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] torch 2.3.0 pypi_0 pypi [conda] transformers 4.42.1 pypi_0 pypi [conda] triton 2.3.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.0.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV1 NV2 NV1 SYS SYS SYS NV2 NODE NODE SYS SYS 0-23,48-71 0 N/A GPU1 NV1 X NV1 NV2 SYS SYS NV2 SYS NODE NODE SYS SYS 0-23,48-71 0 N/A GPU2 NV2 NV1 X NV2 SYS NV1 SYS SYS PIX PIX SYS SYS 0-23,48-71 0 N/A GPU3 NV1 NV2 NV2 X NV1 SYS SYS SYS PIX PIX SYS SYS 0-23,48-71 0 N/A GPU4 SYS SYS SYS NV1 X NV2 NV2 NV1 SYS SYS PIX PIX 24-47,72-95 1 N/A GPU5 SYS SYS NV1 SYS NV2 X NV1 NV2 SYS SYS PIX PIX 24-47,72-95 1 N/A GPU6 SYS NV2 SYS SYS NV2 NV1 X NV1 SYS SYS NODE NODE 24-47,72-95 1 N/A GPU7 NV2 SYS SYS SYS NV1 NV2 NV1 X SYS SYS NODE NODE 24-47,72-95 1 N/A NIC0 NODE NODE PIX PIX SYS SYS SYS SYS X PIX SYS SYS NIC1 NODE NODE PIX PIX SYS SYS SYS SYS PIX X SYS SYS NIC2 SYS SYS SYS SYS PIX PIX NODE NODE SYS SYS X PIX NIC3 SYS SYS SYS SYS PIX PIX NODE NODE SYS SYS PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 ### 🐛 Describe the bug vllm) root@ubuntu:/model# python -u -m vllm.entrypoints.openai.api_server --model /model/models/Mixtral-8x22B-Instruct-v0.1/ --port xxxxxxx --api-key xxxxxxxxxxxxxxxxx -tp 24 --served-model-name Mixtral-8x22B-Instruct-v0.1 --dtype float32 --gpu-memory-utilization 0.98 --disable-custom-all-reduce --enforce-eager --worker-use-ray --engine-use-ray --trust-remote-code INFO 07-03 14:42:55 api_server.py:177] vLLM API server version 0.5.0.post1 INFO 07-03 14:42:55 api_server.py:178] args: Namespace(host=None, port=xxxxxxx, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='sk-xxxxxxxxxxxxxxxxxx, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/model/models/Mixtral-8x22B-Instruct-v0.1/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='float32', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=True, pipeline_parallel_size=1, tensor_parallel_size=24, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.98, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=['Mixtral-8x22B-Instruct-v0.1'], qlora_adapter_name_or_path=None, engine_use_ray=True, disable_log_requests=False, max_log_len=None) INFO 07-03 14:42:55 config.py:1214] Upcasting torch.bfloat16 to torch.float32. 2024-07-03 14:42:55,264 INFO worker.py:1586 -- Connecting to existing Ray cluster at address: 128.1.219.178:6379... 2024-07-03 14:42:55,271 INFO worker.py:1771 -- Connected to Ray cluster. (_AsyncLLMEngine pid=3954, ip=128.1.219.179) INFO 07-03 14:42:57 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/model/models/Mixtral-8x22B-Instruct-v0.1/', speculative_config=None, tokenizer='/model/models/Mixtral-8x22B-Instruct-v0.1/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float32, max_seq_len=65536, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=24, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=Mixtral-8x22B-Instruct-v0.1) (RayWorkerWrapper pid=4501, ip=128.1.219.179) INFO 07-03 14:43:47 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. (RayWorkerWrapper pid=4501, ip=128.1.219.179) INFO 07-03 14:43:47 selector.py:51] Using XFormers backend. (RayWorkerWrapper pid=4139, ip=128.1.219.180) INFO 07-03 14:43:49 utils.py:637] Found nccl from library libnccl.so.2 (RayWorkerWrapper pid=4139, ip=128.1.219.180) INFO 07-03 14:43:49 pynccl.py:63] vLLM is using nccl==2.20.5 (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] Error executing method init_device. This might cause deadlock in distributed execution. (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] Traceback (most recent call last): (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 140, in execute_method (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] return executor(*args, **kwargs) (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^^^^^ (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/worker.py", line 115, in init_device (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] init_worker_distributed_environment(self.parallel_config, self.rank, (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/worker.py", line 357, in init_worker_distributed_environment (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size, (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 655, in ensure_model_parallel_initialized (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] initialize_model_parallel(tensor_model_parallel_size, (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 616, in initialize_model_parallel (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] _TP = GroupCoordinator( (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] ^^^^^^^^^^^^^^^^^ (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 147, in __init__ (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] self.pynccl_comm = PyNcclCommunicator( (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^ (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__ (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] self.comm: ncclComm_t = self.nccl.ncclCommInitRank( (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm), (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] raise RuntimeError(f"NCCL error: {error_str}") (RayWorkerWrapper pid=4139, ip=128.1.219.180) ERROR 07-03 14:43:50 worker_base.py:148] RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers (RayWorkerWrapper pid=10387) ERROR 07-03 14:43:50 worker_base.py:148] RuntimeError: NCCL error: remote process exited or there was a network error (RayWorkerWrapper pid=9821) *** SIGSEGV received at time=1720017831 on cpu 54 *** (RayWorkerWrapper pid=9821) PC: @ 0x7f0a10469996 (unknown) ncclProxyServiceUDS() (RayWorkerWrapper pid=9821) @ 0x7f2757e09520 3384 (unknown) (RayWorkerWrapper pid=9821) @ ... and at least 1 more frames (RayWorkerWrapper pid=9821) [2024-07-03 14:43:51,034 E 9821 10625] logging.cc:440: *** SIGSEGV received at time=1720017831 on cpu 54 *** (RayWorkerWrapper pid=9821) [2024-07-03 14:43:51,034 E 9821 10625] logging.cc:440: PC: @ 0x7f0a10469996 (unknown) ncclProxyServiceUDS() (RayWorkerWrapper pid=9821) [2024-07-03 14:43:51,034 E 9821 10625] logging.cc:440: @ 0x7f2757e09520 3384 (unknown) (RayWorkerWrapper pid=9821) [2024-07-03 14:43:51,034 E 9821 10625] logging.cc:440: @ ... and at least 1 more frames (RayWorkerWrapper pid=9821) Fatal Python error: Segmentation fault (RayWorkerWrapper pid=9821) (RayWorkerWrapper pid=9821) (RayWorkerWrapper pid=9821) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, sentencepiece._sentencepiece, pyarrow.lib, pyarrow._json, PIL._imaging, scipy._lib._ccallback_c, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._cdflib, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.optimize._direct (total: 94) (RayWorkerWrapper pid=10387) (RayWorkerWrapper pid=10387) (RayWorkerWrapper pid=4139, ip=128.1.219.180) (RayWorkerWrapper pid=4139, ip=128.1.219.180) (RayWorkerWrapper pid=4139, ip=128.1.219.180) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, sentencepiece._sentencepiece, pyarrow.lib, pyarrow._json, PIL._imaging, PIL._imagingft (total: 36) (raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffde157c4cf1f6b23423e20a4102000000 Worker ID: 87ef81da3a43bb79f5c81f96798469a8517e0c262856c1c3925d593f Node ID: 72fcc31105b5f056ddf7cf05cca422a06bcf0a40c3076bf4bb2b7f2c Worker IP address: 128.1.219.178 Worker port: 10115 Worker PID: 10387 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

github.com/vllm-project/vllm

[Bug]: SIGSEGV received at time=1721904360 on cpu 140, Fatal Python error: Segmentation fault

opened 11:39AM - 25 Jul 24 UTC

closed 02:09AM - 31 Mar 25 UTC

eldarkurtic

bug stale

### Your current environment My environment setup involving two 8xH100 nodes …is detailed in https://github.com/vllm-project/vllm/issues/6775; therefore, I will omit it here for brevity. ### 🐛 Describe the bug I am trying to run `lm-evaluation-harness` with `vllm` as the backend. The evaluation should run either with TP=16 or with TP=8 + PP=2. The problem I am facing happens after I run this command: ```bash lm_eval \ --model vllm \ --model_args pretrained="meta-llama/Meta-Llama-3.1-8B-Instruct",dtype=auto,gpu_memory_utilization=0.8,add_bos_token=True,max_model_len=2048,tensor_parallel_size=16,distributed_executor_backend="ray" \ --tasks winogrande \ --num_fewshot 5 \ --batch_size auto \ --device cuda ``` and it looks like this: ```bash INFO 07-25 10:44:52 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/home/meta-llama/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='/home/meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=16, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=1234, served_model_name=/home/meta-llama/Meta-Llama-3-8B-Instruct, use_v2_block_manager=False, enable_prefix_caching=False) INFO 07-25 10:45:26 utils.py:784] Found nccl from library libnccl.so.2 INFO 07-25 10:45:26 pynccl.py:63] vLLM is using nccl==2.20.5 (RayWorkerWrapper pid=23732) INFO 07-25 10:45:26 utils.py:784] Found nccl from library libnccl.so.2 (RayWorkerWrapper pid=23732) INFO 07-25 10:45:26 pynccl.py:63] vLLM is using nccl==2.20.5 WARNING 07-25 10:45:39 custom_all_reduce.py:69] Custom allreduce is disabled because this process group spans across nodes. (RayWorkerWrapper pid=1776, ip=192.168.201.208) WARNING 07-25 10:45:39 custom_all_reduce.py:69] Custom allreduce is disabled because this process group spans across nodes. (RayWorkerWrapper pid=2844, ip=192.168.201.208) INFO 07-25 10:45:26 utils.py:784] Found nccl from library libnccl.so.2 [repeated 14x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.) (RayWorkerWrapper pid=2844, ip=192.168.201.208) INFO 07-25 10:45:26 pynccl.py:63] vLLM is using nccl==2.20.5 [repeated 14x across cluster] INFO 07-25 10:45:39 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='192.168.201.210', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fb11c203550>, local_subscribe_port=38653, local_sync_port=49747, remote_subscribe_port=45329, remote_sync_port=39441) INFO 07-25 10:45:39 model_runner.py:680] Starting to load model /home/meta-llama/Meta-Llama-3-8B-Instruct... (RayWorkerWrapper pid=23732) INFO 07-25 10:45:39 model_runner.py:680] Starting to load model /home/meta-llama/Meta-Llama-3-8B-Instruct... Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:00<00:00, 20.10it/s] Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00, 17.74it/s] INFO 07-25 10:45:39 model_runner.py:692] Loading model weights took 0.9685 GB (RayWorkerWrapper pid=23732) INFO 07-25 10:45:39 model_runner.py:692] Loading model weights took 0.9685 GB INFO 07-25 10:45:52 distributed_gpu_executor.py:56] # GPU blocks: 168234, # CPU blocks: 16384 INFO 07-25 10:45:54 model_runner.py:980] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 07-25 10:45:54 model_runner.py:984] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. (RayWorkerWrapper pid=23732) INFO 07-25 10:45:54 model_runner.py:980] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. (RayWorkerWrapper pid=23732) INFO 07-25 10:45:54 model_runner.py:984] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. (RayWorkerWrapper pid=24649) WARNING 07-25 10:45:39 custom_all_reduce.py:69] Custom allreduce is disabled because this process group spans across nodes. [repeated 14x across cluster] (RayWorkerWrapper pid=2844, ip=192.168.201.208) INFO 07-25 10:45:39 model_runner.py:680] Starting to load model /home/meta-llama/Meta-Llama-3-8B-Instruct... [repeated 14x across cluster] (RayWorkerWrapper pid=2080, ip=192.168.201.208) INFO 07-25 10:45:40 model_runner.py:692] Loading model weights took 0.9685 GB [repeated 14x across cluster] (RayWorkerWrapper pid=24036) *** SIGSEGV received at time=1721904360 on cpu 140 *** (RayWorkerWrapper pid=24036) PC: @ 0x7f17f832d44a (unknown) addProxyOpIfNeeded() (RayWorkerWrapper pid=24036) @ 0x7f1bc25d8090 36407536 (unknown) (RayWorkerWrapper pid=24036) @ 0xabaf7f0 80 (unknown) (RayWorkerWrapper pid=24036) @ 0xabaf7f0 (unknown) (unknown) (RayWorkerWrapper pid=24036) [2024-07-25 10:46:00,050 E 24036 24036] logging.cc:440: *** SIGSEGV received at time=1721904360 on cpu 140 *** (RayWorkerWrapper pid=24036) [2024-07-25 10:46:00,050 E 24036 24036] logging.cc:440: PC: @ 0x7f17f832d44a (unknown) addProxyOpIfNeeded() (RayWorkerWrapper pid=24036) [2024-07-25 10:46:00,052 E 24036 24036] logging.cc:440: @ 0x7f1bc25d8090 36407536 (unknown) (RayWorkerWrapper pid=24036) [2024-07-25 10:46:00,054 E 24036 24036] logging.cc:440: @ 0xabaf7f0 80 (unknown) (RayWorkerWrapper pid=24036) [2024-07-25 10:46:00,057 E 24036 24036] logging.cc:440: @ 0xabaf7f0 (unknown) (unknown) (RayWorkerWrapper pid=24036) Fatal Python error: Segmentation fault (RayWorkerWrapper pid=24036) (RayWorkerWrapper pid=24036) Stack (most recent call first): (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 257 in ncclAllReduce (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 118 in all_reduce (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 291 in all_reduce (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/communication_op.py", line 11 in tensor_model_parallel_all_reduce (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 787 in forward (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 89 in forward (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 255 in forward (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 322 in forward (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 422 in forward (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1398 in capture (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1173 in capture_model (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 236 in _warm_up_model (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 220 in initialize_cache (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374 in execute_method (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py", line 467 in _resume_span (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/ray/_private/function_manager.py", line 691 in actor_method_executor (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 887 in main_loop (RayWorkerWrapper pid=24036) File "/usr/local/lib/python3.10/dist-packages/ray/_private/workers/default_worker.py", line 289 in <module> (RayWorkerWrapper pid=24036) (RayWorkerWrapper pid=24036) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, zstandard.backend_c, charset_normalizer.md, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, sentencepiece._sentencepiece, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, pyarrow.lib, pyarrow._json, PIL._imaging, zmq.backend.cython._zmq (total: 42) (RayWorkerWrapper pid=24649) (RayWorkerWrapper pid=24649) (raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffed01258d58da257058d7c65e02000000 Worker ID: dfeb342e54f14dc8d324f5c0ca0e7c5173847f6ae45daf12ba3ad0e9 Node ID: 3474bf626ca27e7ff63f0281c7ac592f54f452c528caf1c9294256ce Worker IP address: 192.168.201.210 Worker port: 10403 Worker PID: 24649 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. (RayWorkerWrapper pid=2844, ip=192.168.201.208) INFO 07-25 10:45:55 model_runner.py:980] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. [repeated 14x across cluster] (RayWorkerWrapper pid=2844, ip=192.168.201.208) INFO 07-25 10:45:55 model_runner.py:984] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. [repeated 14x across cluster] ``` I am actively monitoring CPU memory usage, and it is not even close to the capacity. Both servers have 1.5TB, and this script consumes ~50GBs and then crashes. I have also tried reducing batch-size to 1, but didn't help. I would appreciate any feedback on what is happening under the hood here, especially since the model is 8B with low batch-size and low seq-length, which runs just fine even on a single H100 GPU.

Trace

(RewardLoopWorker pid=3555498) WARNING:2026-01-19 22:51:41,021:We did not manage to extract a prediction in the correct format. Gold: ['\\boxed{B}'], Pred: ['']
(RewardLoopWorker pid=3555498) WARNING:2026-01-19 22:51:41,022:We did not manage to extract a prediction in the correct format. Gold: ['\\boxed{B. 8}'], Pred: ['']
(RewardLoopWorker pid=3555498) WARNING:2026-01-19 22:51:41,024:We did not manage to extract a prediction in the correct format. Gold: ['\\boxed{8}'], Pred: ['']
(WorkerDict pid=3546913) *** SIGSEGV received at time=1768863357 on cpu 79 ***
(WorkerDict pid=3546913) PC: @                0x4  (unknown)  (unknown)
(WorkerDict pid=3546913)     @     0x7ffff7cd2520       4000  (unknown)
(WorkerDict pid=3546913)     @     0x7fce4b9d3155        352  (unknown)
(WorkerDict pid=3546913)     @     0x7fce4b9dde19       1344  (unknown)
(WorkerDict pid=3546913)     @     0x7fce4bac1c6c        224  (unknown)
(WorkerDict pid=3546913)     @     0x7fffdb82094b         64  (unknown)
(WorkerDict pid=3546913)     @     0x7fffdb87c01b        224  cudaGraphLaunch
(WorkerDict pid=3546913)     @     0x7fceef0ca81a  (unknown)  at::cuda::CUDAGraph::replay()
(WorkerDict pid=3546913)     @     0x7fbb7a668130  (unknown)  (unknown)
(WorkerDict pid=3546913)     @                0x4  (unknown)  (unknown)
(WorkerDict pid=3546913) [2026-01-19 22:55:57,644 E 3546913 3548010] logging.cc:474: *** SIGSEGV received at time=1768863357 on cpu 79 ***
(WorkerDict pid=3546913) [2026-01-19 22:55:57,651 E 3546913 3548010] logging.cc:474: PC: @                0x4  (unknown)  (unknown)
(WorkerDict pid=3546913) [2026-01-19 22:55:57,655 E 3546913 3548010] logging.cc:474:     @     0x7ffff7cd2520       4000  (unknown)
(WorkerDict pid=3546913) [2026-01-19 22:55:57,655 E 3546913 3548010] logging.cc:474:     @     0x7fce4b9d3155        352  (unknown)
(WorkerDict pid=3546913) [2026-01-19 22:55:57,655 E 3546913 3548010] logging.cc:474:     @     0x7fce4b9dde19       1344  (unknown)
(WorkerDict pid=3546913) [2026-01-19 22:55:57,655 E 3546913 3548010] logging.cc:474:     @     0x7fce4bac1c6c        224  (unknown)
(WorkerDict pid=3546913) [2026-01-19 22:55:57,655 E 3546913 3548010] logging.cc:474:     @     0x7fffdb82094b         64  (unknown)
(WorkerDict pid=3546913) [2026-01-19 22:55:57,655 E 3546913 3548010] logging.cc:474:     @     0x7fffdb87c01b        224  cudaGraphLaunch
(WorkerDict pid=3546913) [2026-01-19 22:55:57,655 E 3546913 3548010] logging.cc:474:     @     0x7fceef0ca81a  (unknown)  at::cuda::CUDAGraph::replay()
(WorkerDict pid=3546913) [2026-01-19 22:55:57,659 E 3546913 3548010] logging.cc:474:     @     0x7fbb7a668130  (unknown)  (unknown)
(WorkerDict pid=3546913) [2026-01-19 22:55:57,666 E 3546913 3548010] logging.cc:474:     @                0x4  (unknown)  (unknown)
(WorkerDict pid=3546913) Fatal Python error: Segmentation fault
(WorkerDict pid=3546913) 
(WorkerDict pid=3546913) Stack (most recent call first):
(WorkerDict pid=3546913)   File "/workspace/verl_trainer/.venv/lib/python3.12/site-packages/torch/cuda/graphs.py", line 117 in replay
(WorkerDict pid=3546913)   File "/workspace/Zvllm/vllm/compilation/cuda_graph.py", line 198 in __call__
(WorkerDict pid=3546913)   File "/workspace/Zvllm/vllm/v1/worker/gpu_model_runner.py", line 2298 in execute_model
(WorkerDict pid=3546913)   File "/workspace/verl_trainer/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120 in decorate_context
(WorkerDict pid=3546913)   File "/workspace/Zvllm/vllm/v1/worker/gpu_worker.py", line 447 in execute_model
(WorkerDict pid=3546913)   File "/workspace/verl_trainer/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120 in decorate_context
(WorkerDict pid=3546913)   File "/workspace/Zvllm/vllm/utils/__init__.py", line 3122 in run_method
(WorkerDict pid=3546913)   File "/workspace/Zvllm/vllm/worker/worker_base.py", line 267 in execute_method
(WorkerDict pid=3546913)   File "/tmp/ray/session_2026-01-19_10-36-59_738292_3529560/runtime_resources/working_dir_files/_ray_pkg_385e62c0114024ab/verl/workers/rollout/vllm_rollout/vllm_rollout.py", line 224 in _execute_method
(WorkerDict pid=3546913)   File "/tmp/ray/session_2026-01-19_10-36-59_738292_3529560/runtime_resources/working_dir_files/_ray_pkg_385e62c0114024ab/verl/workers/rollout/vllm_rollout/vllm_rollout.py", line 177 in _loop_forever
(WorkerDict pid=3546913)   File "/workspace/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py", line 1012 in run
(WorkerDict pid=3546913)   File "/workspace/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
(WorkerDict pid=3546913)   File "/workspace/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py", line 1032 in _bootstrap
(WorkerDict pid=3546913) 
(WorkerDict pid=3546913) Extension modules: msgpack._cmsgpack, psutil._psutil_linux, google._upb._message, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, regex._regex, markupsafe._speedups, PIL._imaging, scipy._lib._ccallback_c, scipy.linalg._fblas, scipy.linalg._flapack, _cyutility, scipy._cyutility, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_schur_sqrtm, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._slsqplib, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ellip_harm_2, scipy.special._special_ufuncs, scipy.special._gufuncs, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._hausdorff, scipy.spatial._distance_wrap, scipy.spatial.transform._rotation, scipy.spatial.transform._rigid_transform, scipy.optimize._direct, PIL._imagingft, av._core, av.logging, av.bytesource, av.buffer, av.audio.format, av.error, av.dictionary, av.container.pyio, av.option, av.descriptor, av.format, av.utils, av.stream, av.container.streams, av.sidedata.motionvectors, av.sidedata.sidedata, av.opaque, av.packet, av.container.input, av.container.output, av.container.core, av.codec.context, av.video.format, av.video.reformatter, av.plane, av.video.plane, av.video.frame, av.video.stream, av.codec.hwaccel, av.codec.codec, av.frame, av.audio.layout, av.audio.plane, av.audio.frame, av.audio.stream, av.filter.link, av.filter.context, av.filter.graph, av.filter.filter, av.filter.loudnorm, av.audio.resampler, av.audio.codeccontext, av.audio.fifo, av.bitstream, av.video.codeccontext, _cffi_backend, _cbor2, setproctitle._setproctitle, zmq.backend.cython._zmq, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, xxhash._xxhash, pyarrow._acero, pyarrow._csv, pyarrow._json, pyarrow._substrait, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, nvtx._lib.lib, nvtx._lib.profiler, sentencepiece._sentencepiece, msgspec._core, pybase64._pybase64, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.typing.builtins.itertools, numba.cpython.builtins.math, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, vllm.cumem_allocator, cuda_utils, __triton_launcher (total: 221)
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Lease ID: 06000000610da2e5ba04608c250235880f429b08ae2b17471bd777123ce43c68 Worker ID: 72612bb7364a1087afc287488e22d902eb4bcaaaadb5630f2296d358 Node ID: 75d0cfca4322a528d3ea520a8419dd8f8b02f7f42571bda6252c0bc9 Worker IP address: 10.28.5.11 Worker port: 10199 Worker PID: 3546913 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. Some common causes include: (1) the process was killed by the OOM killer due to high memory usage, (2) ray stop --force was called, or (3) the worker crashed unexpectedly due to SIGSEGV or another unexpected error.

I’ve seen the same trace with both triton and flashattn attention backends, inconsistent, sometimes 5 hours into an eval.

RunLLM · January 20, 2026, 7:20pm

Inconsistent SIGSEGV (segmentation fault) errors with vLLM, especially in distributed/Ray setups and with models like Qwen3, are a known issue and have been reported multiple times. The root causes are often related to NCCL communication errors, CUDA graph replay, or network/memory misconfiguration, but there is no single universally resolved root cause as of the latest reports. Upgrading vLLM may help, but even recent versions and different attention backends (triton, flashattn, xformers) have shown similar instability in some environments. Workarounds include setting --enforce-eager to disable CUDA graphs (at a performance cost), tuning memory parameters, and ensuring correct NCCL/GLOO network interface settings, but these do not guarantee a fix for all users or models. See the following for details and user experiences: ray cluster Segmentation fault, SIGSEGV received at time=1721904360.

If you want to debug further, try enabling more logging (VLLM_LOGGING_LEVEL=DEBUG, NCCL_DEBUG=TRACE), and check for OOM or network issues. If the error occurs during CUDA graph replay, try --enforce-eager to isolate the problem. For distributed setups, ensure NCCL_SOCKET_IFNAME and GLOO_SOCKET_IFNAME are set to the correct network interface. There is no evidence in the data sources that a specific vLLM release fully resolves these SIGSEGVs for all users and models. Would you like a step-by-step troubleshooting checklist or more detail on a specific workaround?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
[Bug] Segfault in PythonSymNodeImpl and Deadlock on RTX 5090 (Blackwell) with vLLM 0.11.2 General	1	50	June 1, 2026
vLLM on 4 nodes fails randomly General	1	111	April 6, 2026
Ray cluster DeepSeek-R1-Distill-Qwen-32B-AWQ General	43	724	August 25, 2025
Help for error when run vllm with tensor parallel General	7	2089	May 21, 2025
Failed to run distribute Inference with vllm 0.8.2 General	6	1085	April 27, 2025

Persistent segfaults/SIGSEGV

Related topics