EngineCore Error with NVIDIA-Nemotron-3-Super-120B-A12B-FP8 on 2*H100

I am running the nv-community/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 model (downloaded from modelscope) on a server equipped with 2 H100 PCI-e cards (with NVLink bridge). The software environment is as follows:
Ubuntu 25.10
Nvidia Driver Version: 580.126.20
CUDA 12.8
conda 26.1.1
python3.12
vllm 0.19.1 (torch 2.10.0+cu128, installed via uv pip install vllm --torch-backend=cu128)
Hermes version (Hermes Agent v0.10.0 (2026.4.16)/Python: 3.11.15/OpenAI SDK: 2.32.0)

The running parameters are as follows:
vllm serve /path/to/models/nv-community/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
–host 0.0.0.0
–port 8000
–trust-remote-code
–tensor-parallel-size 2
–disable-custom-all-reduce
–api-key “********”
–served-model-name “nv/Nemo3-120B”
–enable-auto-tool-choice
–tool-call-parser qwen3_coder
–chat-template /path/to/models/nv-community/NVIDIA-Nemotron-3-Super-120B-A12B-FP8/chat_template.jinja

After prolonged use (several hours), the following errors have appeared, which has occurred 3-4 times over the past two days:

(APIServer pid=2816087) INFO 04-20 21:42:57 [loggers.py:259] Engine 000: Avg prompt throughput: 14460.1 tokens/s, Avg generation throughput: 27.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.5%, Prefix cache hit rate: 0.0%
[rank1]:[E420 21:43:00.657365322 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal instruction was encountered
Search for cudaErrorIllegalInstruction in CUDA Runtime API :: CUDA Toolkit Documentation for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x9d (0x75d562172fdd in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: + 0xc0e0 (0x75d59baf20e0 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x75d47cf428c0 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x75d47cf4fa38 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x75d47cf53509 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x75d47cf555a5 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xf2584 (0x75d55bcf2584 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: + 0xa3d64 (0x75d59cca3d64 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x1373fc (0x75d59cd373fc in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of ‘c10::DistBackendError’
what(): [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal instruction was encountered
Search for cudaErrorIllegalInstruction in CUDA Runtime API :: CUDA Toolkit Documentation for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x9d (0x75d562172fdd in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: + 0xc0e0 (0x75d59baf20e0 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x75d47cf428c0 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x75d47cf4fa38 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x75d47cf53509 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x75d47cf555a5 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xf2584 (0x75d55bcf2584 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: + 0xa3d64 (0x75d59cca3d64 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x1373fc (0x75d59cd373fc in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x9d (0x75d562172fdd in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: + 0x9bbbc8 (0x75d47c7bbbc8 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xf2584 (0x75d55bcf2584 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0xa3d64 (0x75d59cca3d64 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x1373fc (0x75d59cd373fc in /lib/x86_64-linux-gnu/libc.so.6)

(EngineCore pid=2816377) ERROR 04-20 21:43:01 [multiproc_executor.py:273] Worker proc VllmWorker-1 died unexpectedly, shutting down executor.
(Worker_TP0 pid=2816594) INFO 04-20 21:43:01 [multiproc_executor.py:764] Parent process exited, terminating worker queues
(Worker_TP0 pid=2816594) INFO 04-20 21:43:01 [multiproc_executor.py:859] WorkerProc shutting down.
(APIServer pid=2816087) INFO 04-20 21:43:07 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.5%, Prefix cache hit rate: 0.0%
(EngineCore pid=2816377) ERROR 04-20 21:43:09 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.19.1) with config: model=‘/home/lgdu/.cache/modelscope/hub/models/nv-community/NVIDIA-Nemotron-3-Super-120B-A12B-FP8’, speculative_config=None, tokenizer=‘/home/lgdu/.cache/modelscope/hub/models/nv-community/NVIDIA-Nemotron-3-Super-120B-A12B-FP8’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=modelopt, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nv/Nemo3-120B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘mode’: <CompilationMode.VLLM_COMPILE: 3>, ‘debug_dump_path’: None, ‘cache_dir’: ‘’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘none’], ‘splitting_ops’: [‘vllm::unified_attention’, ‘vllm::unified_attention_with_output’, ‘vllm::unified_mla_attention’, ‘vllm::unified_mla_attention_with_output’, ‘vllm::mamba_mixer2’, ‘vllm::mamba_mixer’, ‘vllm::short_conv’, ‘vllm::linear_attention’, ‘vllm::plamo2_mamba_mixer’, 'vllm::gdn_attention_co… [truncated]
(EngineCore pid=2816377) ERROR 04-20 21:43:09 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=, scheduled_cached_reqs=CachedRequestData(req_ids=[‘chatcmpl-8de7eb290988c356-b2dffbf2’],resumed_req_ids=set(),new_token_ids_lens=,all_token_ids_lens={},new_block_ids=[None],num_computed_tokens=[72820],num_output_tokens=[358]), num_scheduled_tokens={chatcmpl-8de7eb290988c356-b2dffbf2: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0, 0, 0], finished_req_ids=, free_encoder_mm_hashes=, preempted_req_ids=, has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null, new_block_ids_to_zero=null)
(EngineCore pid=2816377) ERROR 04-20 21:43:09 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.024778761061946875, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore pid=2816377) ERROR 04-20 21:43:09 [core.py:1110] EngineCore encountered a fatal error.
(EngineCore pid=2816377) ERROR 04-20 21:43:09 [core.py:1110] Traceback (most recent call last):
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 1101, in run_engine_core
engine_core.run_busy_loop()
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 1142, in run_busy_loop
self._process_engine_step()
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 1181, in _process_engine_step
outputs, model_executed = self.step_fn()
^^^^^^^^^^^^^^^
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py”, line 84, in result
return super().result()
^^^^^^^^^^^^^^^^^^^^^
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/concurrent/futures/_base.py”, line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py”, line 755, in dequeue
with self.acquire_read(timeout, indefinite) as buf:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py”, line 677, in acquire_read
raise RuntimeError(“cancelled”)
(EngineCore pid=2816377) ERROR 04-20 21:43:09 [core.py:1110] RuntimeError: cancelled
(APIServer pid=2816087) ERROR 04-20 21:43:09 [async_llm.py:707] AsyncLLM output_handler failed.
(APIServer pid=2816087) ERROR 04-20 21:43:09 [async_llm.py:707] Traceback (most recent call last):
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py”, line 663, in output_handler
outputs = await engine_core.get_output_async()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/v1/engine/core_client.py”, line 985, in get_output_async
raise self._format_exception(outputs) from None
(APIServer pid=2816087) ERROR 04-20 21:43:09 [serving.py:1261] Error in chat completion stream generator.
(APIServer pid=2816087) ERROR 04-20 21:43:09 [serving.py:1261] Traceback (most recent call last):
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py”, line 582, in generate
out = q.get_nowait() or await q.get()
^^^^^^^^^^^^^
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/v1/engine/output_processor.py”, line 85, in get
raise output
(APIServer pid=2816087) ERROR 04-20 21:43:09 [serving.py:1261] File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py”, line 663, in output_handler
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2816087) ERROR 04-20 21:43:09 [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=2816087) ERROR 04-20 21:43:09 [serving.py:1261] Error in chat completion stream generator.
(APIServer pid=2816087) ERROR 04-20 21:43:09 [serving.py:1261] Traceback (most recent call last):
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py”, line 985, in get_output_async
raise self._format_exception(outputs) from None
(APIServer pid=2816087) ERROR 04-20 21:43:09 [serving.py:1261] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=2816087) INFO: 127.0.0.1:42132 - “POST /v1/chat/completions HTTP/1.1” 500 Internal Server Error
(APIServer pid=2816087) INFO: Shutting down
(APIServer pid=2816087) INFO: Waiting for application shutdown.
(APIServer pid=2816087) INFO: Application shutdown complete.
(APIServer pid=2816087) INFO: Finished server process [2816087]

It appears the error occurred in an asynchronous process. After setting CUDA_LAUNCH_BLOCKING=1, the error did not reoccur during usage, but the speed became significantly slower. Are there other ways to avoid this error while maintaining high speed?

Your error (CUDA error: an illegal instruction was encountered) is a known issue when running Nemotron-3 Super 120B FP8 on H100s with vLLM, especially with tensor parallelism and async execution. Setting CUDA_LAUNCH_BLOCKING=1 avoids the crash but significantly reduces throughput, as you observed. This is due to a bug in vLLM’s Triton-based Mamba ops on Blackwell/Hopper GPUs, which only manifests in async mode; the root cause is likely a missing memory barrier or incorrect PTX target in the Triton kernel JIT output for these GPUs. There is currently no official fix, and the only reliable workaround is to use CUDA_LAUNCH_BLOCKING=1 (with or without --enforce-eager), which unfortunately slows down inference by ~37% or more and limits GPU utilization (see details).

Alternative workarounds, such as using the native causal-conv1d/mamba-ssm CUDA kernels or switching to a different attention backend, are not yet available in vLLM for this model. Some users have reported partial success with --moe-backend marlin, but this may not apply to your model and setup. There is no parameter or environment variable in vLLM 0.19.1 that can fully resolve this issue while maintaining high speed, other than the blocking workaround. For more information and updates, monitor this issue.

Would you like a step-by-step breakdown of the technical details or possible experimental debugging steps?

Sources:

Hint: Mention @RunLLM in the post for followups.

Thank you, I will continue to follow these two issues.