I am running the nv-community/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 model (downloaded from modelscope) on a server equipped with 2 H100 PCI-e cards (with NVLink bridge). The software environment is as follows:
Ubuntu 25.10
Nvidia Driver Version: 580.126.20
CUDA 12.8
conda 26.1.1
python3.12
vllm 0.19.1 (torch 2.10.0+cu128, installed via uv pip install vllm --torch-backend=cu128)
Hermes version (Hermes Agent v0.10.0 (2026.4.16)/Python: 3.11.15/OpenAI SDK: 2.32.0)
The running parameters are as follows:
vllm serve /path/to/models/nv-community/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
–host 0.0.0.0
–port 8000
–trust-remote-code
–tensor-parallel-size 2
–disable-custom-all-reduce
–api-key “********”
–served-model-name “nv/Nemo3-120B”
–enable-auto-tool-choice
–tool-call-parser qwen3_coder
–chat-template /path/to/models/nv-community/NVIDIA-Nemotron-3-Super-120B-A12B-FP8/chat_template.jinja
After prolonged use (several hours), the following errors have appeared, which has occurred 3-4 times over the past two days:
(APIServer pid=2816087) INFO 04-20 21:42:57 [loggers.py:259] Engine 000: Avg prompt throughput: 14460.1 tokens/s, Avg generation throughput: 27.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.5%, Prefix cache hit rate: 0.0%
[rank1]:[E420 21:43:00.657365322 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal instruction was encountered
Search for cudaErrorIllegalInstruction in CUDA Runtime API :: CUDA Toolkit Documentation for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x9d (0x75d562172fdd in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: + 0xc0e0 (0x75d59baf20e0 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x75d47cf428c0 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x75d47cf4fa38 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x75d47cf53509 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x75d47cf555a5 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xf2584 (0x75d55bcf2584 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: + 0xa3d64 (0x75d59cca3d64 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x1373fc (0x75d59cd373fc in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of ‘c10::DistBackendError’
what(): [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal instruction was encountered
Search for cudaErrorIllegalInstruction in CUDA Runtime API :: CUDA Toolkit Documentation for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x9d (0x75d562172fdd in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: + 0xc0e0 (0x75d59baf20e0 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x75d47cf428c0 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x75d47cf4fa38 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x75d47cf53509 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x75d47cf555a5 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xf2584 (0x75d55bcf2584 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: + 0xa3d64 (0x75d59cca3d64 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x1373fc (0x75d59cd373fc in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x9d (0x75d562172fdd in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: + 0x9bbbc8 (0x75d47c7bbbc8 in /home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xf2584 (0x75d55bcf2584 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0xa3d64 (0x75d59cca3d64 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x1373fc (0x75d59cd373fc in /lib/x86_64-linux-gnu/libc.so.6)
(EngineCore pid=2816377) ERROR 04-20 21:43:01 [multiproc_executor.py:273] Worker proc VllmWorker-1 died unexpectedly, shutting down executor.
(Worker_TP0 pid=2816594) INFO 04-20 21:43:01 [multiproc_executor.py:764] Parent process exited, terminating worker queues
(Worker_TP0 pid=2816594) INFO 04-20 21:43:01 [multiproc_executor.py:859] WorkerProc shutting down.
(APIServer pid=2816087) INFO 04-20 21:43:07 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.5%, Prefix cache hit rate: 0.0%
(EngineCore pid=2816377) ERROR 04-20 21:43:09 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.19.1) with config: model=‘/home/lgdu/.cache/modelscope/hub/models/nv-community/NVIDIA-Nemotron-3-Super-120B-A12B-FP8’, speculative_config=None, tokenizer=‘/home/lgdu/.cache/modelscope/hub/models/nv-community/NVIDIA-Nemotron-3-Super-120B-A12B-FP8’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=modelopt, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nv/Nemo3-120B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘mode’: <CompilationMode.VLLM_COMPILE: 3>, ‘debug_dump_path’: None, ‘cache_dir’: ‘’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘none’], ‘splitting_ops’: [‘vllm::unified_attention’, ‘vllm::unified_attention_with_output’, ‘vllm::unified_mla_attention’, ‘vllm::unified_mla_attention_with_output’, ‘vllm::mamba_mixer2’, ‘vllm::mamba_mixer’, ‘vllm::short_conv’, ‘vllm::linear_attention’, ‘vllm::plamo2_mamba_mixer’, 'vllm::gdn_attention_co… [truncated]
(EngineCore pid=2816377) ERROR 04-20 21:43:09 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=, scheduled_cached_reqs=CachedRequestData(req_ids=[‘chatcmpl-8de7eb290988c356-b2dffbf2’],resumed_req_ids=set(),new_token_ids_lens=,all_token_ids_lens={},new_block_ids=[None],num_computed_tokens=[72820],num_output_tokens=[358]), num_scheduled_tokens={chatcmpl-8de7eb290988c356-b2dffbf2: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0, 0, 0], finished_req_ids=, free_encoder_mm_hashes=, preempted_req_ids=, has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null, new_block_ids_to_zero=null)
(EngineCore pid=2816377) ERROR 04-20 21:43:09 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.024778761061946875, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore pid=2816377) ERROR 04-20 21:43:09 [core.py:1110] EngineCore encountered a fatal error.
(EngineCore pid=2816377) ERROR 04-20 21:43:09 [core.py:1110] Traceback (most recent call last):
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 1101, in run_engine_core
engine_core.run_busy_loop()
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 1142, in run_busy_loop
self._process_engine_step()
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 1181, in _process_engine_step
outputs, model_executed = self.step_fn()
^^^^^^^^^^^^^^^
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py”, line 84, in result
return super().result()
^^^^^^^^^^^^^^^^^^^^^
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/concurrent/futures/_base.py”, line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py”, line 755, in dequeue
with self.acquire_read(timeout, indefinite) as buf:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py”, line 677, in acquire_read
raise RuntimeError(“cancelled”)
(EngineCore pid=2816377) ERROR 04-20 21:43:09 [core.py:1110] RuntimeError: cancelled
(APIServer pid=2816087) ERROR 04-20 21:43:09 [async_llm.py:707] AsyncLLM output_handler failed.
(APIServer pid=2816087) ERROR 04-20 21:43:09 [async_llm.py:707] Traceback (most recent call last):
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py”, line 663, in output_handler
outputs = await engine_core.get_output_async()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/v1/engine/core_client.py”, line 985, in get_output_async
raise self._format_exception(outputs) from None
(APIServer pid=2816087) ERROR 04-20 21:43:09 [serving.py:1261] Error in chat completion stream generator.
(APIServer pid=2816087) ERROR 04-20 21:43:09 [serving.py:1261] Traceback (most recent call last):
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py”, line 582, in generate
out = q.get_nowait() or await q.get()
^^^^^^^^^^^^^
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/v1/engine/output_processor.py”, line 85, in get
raise output
(APIServer pid=2816087) ERROR 04-20 21:43:09 [serving.py:1261] File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py”, line 663, in output_handler
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2816087) ERROR 04-20 21:43:09 [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=2816087) ERROR 04-20 21:43:09 [serving.py:1261] Error in chat completion stream generator.
(APIServer pid=2816087) ERROR 04-20 21:43:09 [serving.py:1261] Traceback (most recent call last):
File “/home/lgdu/miniconda3/envs/vllm_new/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py”, line 985, in get_output_async
raise self._format_exception(outputs) from None
(APIServer pid=2816087) ERROR 04-20 21:43:09 [serving.py:1261] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=2816087) INFO: 127.0.0.1:42132 - “POST /v1/chat/completions HTTP/1.1” 500 Internal Server Error
(APIServer pid=2816087) INFO: Shutting down
(APIServer pid=2816087) INFO: Waiting for application shutdown.
(APIServer pid=2816087) INFO: Application shutdown complete.
(APIServer pid=2816087) INFO: Finished server process [2816087]
It appears the error occurred in an asynchronous process. After setting CUDA_LAUNCH_BLOCKING=1, the error did not reoccur during usage, but the speed became significantly slower. Are there other ways to avoid this error while maintaining high speed?