Intermittent Service Downtime Issue with Magistral-Small-2506 Model on GPU VM

Hi,

We have deployed the model “Magistral-Small-2506” on a virtual machine with a GPU card. Our environment details are as follows:

  • OS version: Ubuntu 22.04.5

  • GPU card model: Nvidia A100

  • vllm version: 0.10

Recently, we noticed that the model service suddenly went down. We restarted the model, and it resumed normal operation. However, after running for a while, the problem occurred again. We tried to identify the root cause by examining the model logs but still have no idea what might be causing the issue.

We start the model using the following command:

CUDA_LAUNCH_BLOCKING=1 VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 CUDA_VISIBLE_DEVICES=0,1 vllm serve /models/Magistral-Small-2506 --served-model-name Magistral-Small-2506 --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice --tensor-parallel-size 2 --gpu-memory-utilization 0.9 --port 7999 --max-model-len 131072 >> /var/log/vllm/vllm-Magistral-Small-2506.log &

Do you have any idea what might be causing this? Additionally, I have enclosed the error log for your reference.

Thanks.

Jimmy

=================================================================

INFO 08-14 05:20:02 [async_llm.py:269] Added request chatcmpl-a46a5300346a43ed9d60e604c2683451.
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] WorkerProc hit an exception.
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] Traceback (most recent call last):
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] File “/home/llmsvc1/venv/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py”, line 541, in worker_busy_loop
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] output = func(*args, **kwargs)
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] File “/home/llmsvc1/venv/lib/python3.10/site-packages/torch/utils/contextlib.py", line 116, in decorate_context
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] return func(*args, **kwargs)
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] File “/home/llmsvc1/venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py”, line 337, in execute_model
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] output = self.model_runner.execute_model(scheduler_output,
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] File “/home/llmsvc1/venv/lib/python3.10/site-packages/torch/utils/contextlib.py”, line 116, in decorate_context
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] return func(*args, **kwargs)
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] File “/home/llmsvc1/venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 1450, in execute_model
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] model_output = self.model(
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] File “/home/llmsvc1/venv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1751, in wrapped_call_impl
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] return self.call_impl(*args, **kwargs)
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] File “/home/llmsvc1/venv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1762, in call_impl
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] return forward_call(*args, **kwargs)
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] File “/home/llmsvc1/venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py”, line 584, in forward
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] model_output = self.model(input_ids, positions, intermediate_tensors,
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] File “/home/llmsvc1/venv/lib/python3.10/site-packages/vllm/compilation/decorators.py”, line 279, in call
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] model_output = self.forward(*args, **kwargs)
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] File “/home/llmsvc1/venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py”, line 368, in forward
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] def forward(
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] File “/home/llmsvc1/venv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1751, in wrapped_call_impl
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] return self.call_impl(*args, **kwargs)
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] File “/home/llmsvc1/venv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1762, in call_impl
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] return forward_call(*args, **kwargs)
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] File “/home/llmsvc1/venv/lib/python3.10/site-packages/torch/dynamo/eval_frame.py”, line 838, in fn
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] return fn(*args, **kwargs)
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] File “/home/llmsvc1/venv/lib/python3.10/site-packages/torch/fx/graph_module.py”, line 830, in call_wrapped
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] return self.wrapped_call(self, *args, **kwargs)
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] File “/home/llmsvc1/venv/lib/python3.10/site-packages/torch/fx/graph_module.py”, line 406, in call
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] raise e
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] File “/home/llmsvc1/venv/lib/python3.10/site-packages/torch/fx/graph_module.py”, line 393, in call
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc]
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] File “/home/llmsvc1/venv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1751, in wrapped_call_impl
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] return self.call_impl(*args, **kwargs)
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] File “/home/llmsvc1/venv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1762, in call_impl
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] return forward_call(*args, **kwargs)
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] File “<eval_with_key>.82”, line 306, in forward
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] submod_16 = self.submod_16(getitem_38, s0, l_self_modules_layers_modules_7_modules_self_attn_modules_o_proj_parameters_weight
, getitem_39, l_self_modules_layers_modules_7_modules_post_attention_layernorm_parameters_weight, l_self_modules_layers_modules_7_modules_mlp_modules_gate_up_proj_parameters_weight, l_self_modules_layers_modules_7_modules_mlp_modules_down_proj_parameters_weight, l_self_modules_layers_modules_8_modules_input_layernorm_parameters_weight, l_self_modules_layers_modules_8_modules_self_attn_modules_qkv_proj_parameters_weight, l_positions, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache); getitem_38 = l_self_modules_layers_modules_7_modules_self_attn_modules_o_proj_parameters_weight = getitem_39 = l_self_modules_layers_modules_7_modules_post_attention_layernorm_parameters_weight = l_self_modules_layers_modules_7_modules_mlp_modules_gate_up_proj_parameters_weight = l_self_modules_layers_modules_7_modules_mlp_modules_down_proj_parameters_weight = l_self_modules_layers_modules_8_modules_input_layernorm_parameters_weight = l_self_modules_layers_modules_8_modules_self_attn_modules_qkv_proj_parameters_weight = None
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] File “/home/llmsvc1/venv/lib/python3.10/site-packages/vllm/compilation/cuda_piecewise_backend.py”, line 217, in call
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] entry.cudagraph.replay()
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] File “/home/llmsvc1/venv/lib/python3.10/site-packages/torch/cuda/graphs.py”, line 88, in replay
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] super().replay()
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] RuntimeError: CUDA error: unrecognized error code
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(VllmWorker rank=1 pid=1420673) ERROR 08-14 05:20:02 [multiproc_executor.py:546]
INFO 08-14 05:21:32 [async_llm.py:428] Aborted request chatcmpl-a46a5300346a43ed9d60e604c2683451.
INFO 08-14 05:21:32 [async_llm.py:336] Request chatcmpl-a46a5300346a43ed9d60e604c2683451 aborted.
INFO 08-14 05:21:32 [logger.py:41] Received request chatcmpl-bcab15f76c9344f482e83c37ffa7a1cc: prompt: None, params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.1, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[‘<|im_end|>’], stop_token_ids=, bad_words=, include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: [1, 3, 10114, 1576, 1073, 1855, 23020, 1321, 47372, 1039, 1693, 1636, 1584, 23020, 4], prompt_embeds shape: None, lora_request: None.
INFO 08-14 05:21:32 [async_llm.py:269] Added request chatcmpl-bcab15f76c9344f482e83c37ffa7a1cc.
INFO 08-14 05:23:02 [async_llm.py:428] Aborted request chatcmpl-bcab15f76c9344f482e83c37ffa7a1cc.
INFO 08-14 05:23:02 [async_llm.py:336] Request chatcmpl-bcab15f76c9344f482e83c37ffa7a1cc aborted.
INFO 08-14 05:23:03 [logger.py:41] Received request chatcmpl-b1b1086e0c5f492baaf149cda029daad: prompt: None, params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.1, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[‘<|im_end|>’], stop_token_ids=, bad_words=, include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: [1, 3, 10114, 1576, 1073, 1855, 23020, 1321, 47372, 1039, 1693, 1636, 1584, 23020, 4], prompt_embeds shape: None, lora_request: None.
INFO 08-14 05:23:03 [async_llm.py:269] Added request chatcmpl-b1b1086e0c5f492baaf149cda029daad.
INFO 08-14 05:24:33 [async_llm.py:428] Aborted request chatcmpl-b1b1086e0c5f492baaf149cda029daad.
INFO 08-14 05:24:33 [async_llm.py:336] Request chatcmpl-b1b1086e0c5f492baaf149cda029daad aborted.
ERROR 08-14 05:25:02 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.0) with config: model=‘/models/Magistral-Small-2506’, speculative_config=None, tokenizer=‘/models/Magistral-Small-2506’, skip_tokenizer_init=False, tokenizer_mode=mistral, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=mistral, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=‘’), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Magistral-Small-2506, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={“level”:3,“debug_dump_path”:”“,“cache_dir”:”“,“backend”:”“,“custom_ops”:,“splitting_ops”:[“vllm.unified_attention”,“vllm.unified_attention_with_output”,“vllm.mamba_mixer2”],“use_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“use_cudagraph”:true,“cudagraph_num_of_warmups”:1,“cudagraph_capture_sizes”:[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“max_capture_size”:512,“local_cache_dir”:null},
ERROR 08-14 05:25:02 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-a46a5300346a43ed9d60e604c2683451,prompt_token_ids_len=15,mm_inputs=,mm_hashes=,mm_positions=,sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.1, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[‘<|im_end|>’], stop_token_ids=, bad_words=, include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None),block_ids=([3199],),num_computed_tokens=0,lora_request=None)], scheduled_cached_reqs=CachedRequestData(req_ids=, resumed_from_preemption=, new_token_ids=, new_block_ids=, num_computed_tokens=), num_scheduled_tokens={chatcmpl-a46a5300346a43ed9d60e604c2683451: 15}, total_num_scheduled_tokens=15, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[1], finished_req_ids=, free_encoder_input_ids=, structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
ERROR 08-14 05:25:02 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, kv_cache_usage=5.1201966155489664e-05, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=15, hits=0), spec_decoding_stats=None, num_corrupted_reqs=0)
ERROR 08-14 05:25:02 [core.py:634] EngineCore encountered a fatal error.
ERROR 08-14 05:25:02 [core.py:634] Traceback (most recent call last):
ERROR 08-14 05:25:02 [core.py:634] File “/home/llmsvc1/venv/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py”, line 237, in collective_rpc
ERROR 08-14 05:25:02 [core.py:634] result = get_response(w, dequeue_timeout)
ERROR 08-14 05:25:02 [core.py:634] File “/home/llmsvc1/venv/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py”, line 220, in get_response
ERROR 08-14 05:25:02 [core.py:634] status, result = w.worker_response_mq.dequeue(
ERROR 08-14 05:25:02 [core.py:634] File “/home/llmsvc1/venv/lib/python3.10/site-packages/vllm/distributed/device_communicators/shm_broadcast.py”, line 507, in dequeue
ERROR 08-14 05:25:02 [core.py:634] with self.acquire_read(timeout, cancel) as buf:
ERROR 08-14 05:25:02 [core.py:634] File “/usr/lib/python3.10/contextlib.py”, line 135, in enter
ERROR 08-14 05:25:02 [core.py:634] return next(self.gen)
ERROR 08-14 05:25:02 [core.py:634] File “/home/llmsvc1/venv/lib/python3.10/site-packages/vllm/distributed/device_communicators/shm_broadcast.py”, line 469, in acquire_read
ERROR 08-14 05:25:02 [core.py:634] raise TimeoutError
ERROR 08-14 05:25:02 [core.py:634] TimeoutError
ERROR 08-14 05:25:02 [core.py:634]
ERROR 08-14 05:25:02 [core.py:634] The above exception was the direct cause of the following exception:
ERROR 08-14 05:25:02 [core.py:634]
ERROR 08-14 05:25:02 [core.py:634] Traceback (most recent call last):
ERROR 08-14 05:25:02 [core.py:634] File “/home/llmsvc1/venv/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 625, in run_engine_core
ERROR 08-14 05:25:02 [core.py:634] engine_core.run_busy_loop()
ERROR 08-14 05:25:02 [core.py:634] File “/home/llmsvc1/venv/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 652, in run_busy_loop
ERROR 08-14 05:25:02 [core.py:634] self._process_engine_step()
ERROR 08-14 05:25:02 [core.py:634] File “/home/llmsvc1/venv/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 677, in _process_engine_step
ERROR 08-14 05:25:02 [core.py:634] outputs, model_executed = self.step_fn()
ERROR 08-14 05:25:02 [core.py:634] File “/home/llmsvc1/venv/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 267, in step
ERROR 08-14 05:25:02 [core.py:634] model_output = self.execute_model_with_error_logging(
ERROR 08-14 05:25:02 [core.py:634] File “/home/llmsvc1/venv/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 253, in execute_model_with_error_logging
ERROR 08-14 05:25:02 [core.py:634] raise err
ERROR 08-14 05:25:02 [core.py:634] File “/home/llmsvc1/venv/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 244, in execute_model_with_error_logging
ERROR 08-14 05:25:02 [core.py:634] return model_fn(scheduler_output)
ERROR 08-14 05:25:02 [core.py:634] File “/home/llmsvc1/venv/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py”, line 167, in execute_model
ERROR 08-14 05:25:02 [core.py:634] (output, ) = self.collective_rpc(
ERROR 08-14 05:25:02 [core.py:634] File “/home/llmsvc1/venv/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py”, line 243, in collective_rpc
ERROR 08-14 05:25:02 [core.py:634] raise TimeoutError(f"RPC call to {method} timed out.”) from e
ERROR 08-14 05:25:02 [core.py:634] TimeoutError: RPC call to execute_model timed out.
ERROR 08-14 05:25:02 [async_llm.py:416] AsyncLLM output_handler failed.
ERROR 08-14 05:25:02 [async_llm.py:416] Traceback (most recent call last):
ERROR 08-14 05:25:02 [async_llm.py:416] File “/home/llmsvc1/venv/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py”, line 375, in output_handler
ERROR 08-14 05:25:02 [async_llm.py:416] outputs = await engine_core.get_output_async()
ERROR 08-14 05:25:02 [async_llm.py:416] File “/home/llmsvc1/venv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py”, line 751, in get_output_async
ERROR 08-14 05:25:02 [async_llm.py:416] raise self._format_exception(outputs) from None
ERROR 08-14 05:25:02 [async_llm.py:416] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:594 ‘unrecognized error code’
INFO 08-14 05:32:10 [init.py:235] Automatically detected platform cuda.
INFO 08-14 05:32:13 [api_server.py:1755] vLLM API server version 0.10.0
INFO 08-14 05:32:13 [cli_args.py:261] non-default args: {‘model_tag’: ‘/models/Magistral-Small-2506’, ‘port’: 7999, ‘enable_auto_tool_choice’: True, ‘tool_call_parser’: ‘mistral’, ‘model’: ‘/models/Magistral-Small-2506’, ‘tokenizer_mode’: ‘mistral’, ‘max_model_len’: 131072, ‘served_model_name’: [‘Magistral-Small-2506’], ‘config_format’: ‘mistral’, ‘load_format’: ‘mistral’, ‘tensor_parallel_size’: 2}
ERROR 08-14 05:32:18 [config.py:133] Error retrieving safetensors: Repo id must be in the form ‘repo_name’ or ‘namespace/repo_name’: ‘/models/Magistral-Small-2506’. Use repo_type argument if needed., retrying 1 of 2
ERROR 08-14 05:32:20 [config.py:131] Error retrieving safetensors: Repo id must be in the form ‘repo_name’ or ‘namespace/repo_name’: ‘/models/Magistral-Small-2506’. Use repo_type argument if needed.

Your error log shows a fatal CUDA error: “unrecognized error code” during CUDA graph replay, followed by a timeout in vLLM’s multiprocessing executor. This is often caused by a low-level CUDA kernel crash, which can be triggered by out-of-memory, invalid memory access, or a bug in the model or CUDA backend. The log also shows a failed custom all-reduce operation and a timeout in shared memory communication, both of which are consistent with a CUDA kernel crash or GPU process death. The error is not specific to your model, but is a known class of issues in vLLM when running large models with high context length and tensor parallelism, especially with CUDA graphs enabled and high GPU memory utilization settings (see similar issues, troubleshooting guide).

Recommended steps:

  • Lower --max-model-len (e.g., try 32768 or 65536 instead of 131072).
  • Reduce --gpu-memory-utilization (e.g., 0.85 or 0.8).
  • Try adding --enforce-eager to disable CUDA graphs (this will reduce performance but can avoid some CUDA graph bugs).
  • Make sure your CUDA, PyTorch, and vLLM versions are compatible and up to date.
  • If the error persists, try running with a single GPU (--tensor-parallel-size 1) to isolate if the issue is related to multi-GPU communication.

Would you like a step-by-step breakdown of how to systematically debug this further?

Sources: