I plan to deploy and autonomously configure Tensor Parallel (TP) dimensions using the VLLM framework based on the local Qwen3-30B-A3B model. At the same time, I will perform performance analysis using NSYS tools. However, after multiple operations, the same error occurs, and when the tensor parallel dimension is set to 2, 4, or 8, the error will reappear.
nsys profile -o ./vllm_profile.qdrep python vllm_profiler.py
INFO 10-15 08:32:21 [init.py:216] Automatically detected platform cuda.
Loading model: /home/ffc3/yfzhao/models/Qwen3-30B-A3B
INFO 10-15 08:32:24 [utils.py:233] non-default args: {‘trust_remote_code’: True, ‘tensor_parallel_size’: 4, ‘gpu_memory_utilization’: 0.8, ‘disable_log_stats’: True, ‘model’: ‘/home/ffc3/yfzhao/models/Qwen3-30B-A3B’}
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
INFO 10-15 08:32:24 [model.py:547] Resolved architecture: Qwen3MoeForCausalLM
torch_dtype is deprecated! Use dtype instead!
INFO 10-15 08:32:24 [model.py:1510] Using max model len 40960
INFO 10-15 08:32:25 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=8192.
(EngineCore_DP0 pid=59619) INFO 10-15 08:32:25 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=59619) INFO 10-15 08:32:25 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model=‘/home/ffc3/yfzhao/models/Qwen3-30B-A3B’, speculative_config=None, tokenizer=‘/home/ffc3/yfzhao/models/Qwen3-30B-A3B’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/ffc3/yfzhao/models/Qwen3-30B-A3B, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={“level”:3,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:,“splitting_ops”:[“vllm.unified_attention”,“vllm.unified_attention_with_output”,“vllm.mamba_mixer2”,“vllm.mamba_mixer”,“vllm.short_conv”,“vllm.linear_attention”,“vllm.plamo2_mamba_mixer”,“vllm.gdn_attention”,“vllm.sparse_attn_indexer”],“use_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“cudagraph_mode”:[2,1],“use_cudagraph”:true,“cudagraph_num_of_warmups”:1,“cudagraph_capture_sizes”:[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“use_inductor_graph_partition”:false,“pass_config”:{},“max_capture_size”:512,“local_cache_dir”:null}
(EngineCore_DP0 pid=59619) WARNING 10-15 08:32:25 [multiproc_executor.py:720] Reducing Torch parallelism from 56 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=59619) INFO 10-15 08:32:25 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, ‘psm_734fb5d4’), local_subscribe_addr=‘ipc:///tmp/177cc19d-0c6f-49d9-ba6a-e39f33447703’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_DP0 pid=59619) INFO 10-15 08:32:28 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_85b837c4’), local_subscribe_addr=‘ipc:///tmp/519f3dbd-e3b4-4cee-aa1d-b18bc334d23e’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_DP0 pid=59619) INFO 10-15 08:32:28 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_8a4b849e’), local_subscribe_addr=‘ipc:///tmp/1965489f-d6da-4dc2-a7c5-5e88dcdf3de8’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_DP0 pid=59619) INFO 10-15 08:32:28 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_16d2d2d0’), local_subscribe_addr=‘ipc:///tmp/3e17b5f5-6a74-4004-bc7a-c4b1fef6bdb2’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_DP0 pid=59619) INFO 10-15 08:32:28 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_a2ae80d9’), local_subscribe_addr=‘ipc:///tmp/80809eff-23a6-4de3-8382-39a56311b3d2’, remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
(EngineCore_DP0 pid=59619) INFO 10-15 08:32:29 [init.py:1384] Found nccl from library libnccl.so.2
(EngineCore_DP0 pid=59619) INFO 10-15 08:32:29 [pynccl.py:103] vLLM is using nccl==2.27.3
(EngineCore_DP0 pid=59619) INFO 10-15 08:32:29 [init.py:1384] Found nccl from library libnccl.so.2
(EngineCore_DP0 pid=59619) INFO 10-15 08:32:29 [pynccl.py:103] vLLM is using nccl==2.27.3
(EngineCore_DP0 pid=59619) INFO 10-15 08:32:29 [init.py:1384] Found nccl from library libnccl.so.2
(EngineCore_DP0 pid=59619) INFO 10-15 08:32:29 [pynccl.py:103] vLLM is using nccl==2.27.3
(EngineCore_DP0 pid=59619) INFO 10-15 08:32:29 [init.py:1384] Found nccl from library libnccl.so.2
(EngineCore_DP0 pid=59619) INFO 10-15 08:32:29 [pynccl.py:103] vLLM is using nccl==2.27.3
(EngineCore_DP0 pid=59619) ERROR 10-15 08:32:29 [core.py:708] EngineCore failed to start.
(EngineCore_DP0 pid=59619) ERROR 10-15 08:32:29 [core.py:708] Traceback (most recent call last):
(EngineCore_DP0 pid=59619) ERROR 10-15 08:32:29 [core.py:708] File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 699, in run_engine_core
(EngineCore_DP0 pid=59619) ERROR 10-15 08:32:29 [core.py:708] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=59619) ERROR 10-15 08:32:29 [core.py:708] File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 498, in init
(EngineCore_DP0 pid=59619) ERROR 10-15 08:32:29 [core.py:708] super().init(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=59619) ERROR 10-15 08:32:29 [core.py:708] File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 83, in init
(EngineCore_DP0 pid=59619) ERROR 10-15 08:32:29 [core.py:708] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=59619) ERROR 10-15 08:32:29 [core.py:708] File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/site-packages/vllm/executor/executor_base.py”, line 54, in init
(EngineCore_DP0 pid=59619) ERROR 10-15 08:32:29 [core.py:708] self._init_executor()
(EngineCore_DP0 pid=59619) ERROR 10-15 08:32:29 [core.py:708] File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py”, line 106, in _init_executor
(EngineCore_DP0 pid=59619) ERROR 10-15 08:32:29 [core.py:708] self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=59619) ERROR 10-15 08:32:29 [core.py:708] File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py”, line 509, in wait_for_ready
(EngineCore_DP0 pid=59619) ERROR 10-15 08:32:29 [core.py:708] raise e from None
(EngineCore_DP0 pid=59619) ERROR 10-15 08:32:29 [core.py:708] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=59619) Process EngineCore_DP0:
(EngineCore_DP0 pid=59619) Traceback (most recent call last):
(EngineCore_DP0 pid=59619) File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/multiprocessing/process.py”, line 314, in _bootstrap
(EngineCore_DP0 pid=59619) self.run()
(EngineCore_DP0 pid=59619) File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/multiprocessing/process.py”, line 108, in run
(EngineCore_DP0 pid=59619) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=59619) File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 712, in run_engine_core
(EngineCore_DP0 pid=59619) raise e
(EngineCore_DP0 pid=59619) File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 699, in run_engine_core
(EngineCore_DP0 pid=59619) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=59619) File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 498, in init
(EngineCore_DP0 pid=59619) super().init(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=59619) File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 83, in init
(EngineCore_DP0 pid=59619) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=59619) File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/site-packages/vllm/executor/executor_base.py”, line 54, in init
(EngineCore_DP0 pid=59619) self._init_executor()
(EngineCore_DP0 pid=59619) File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py”, line 106, in _init_executor
(EngineCore_DP0 pid=59619) self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=59619) File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py”, line 509, in wait_for_ready
(EngineCore_DP0 pid=59619) raise e from None
(EngineCore_DP0 pid=59619) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Traceback (most recent call last):
File “/home/ffc3/shr/vllm_profiler.py”, line 57, in main
profiler.setup_model()
File “/home/ffc3/shr/vllm_profiler.py”, line 18, in setup_model
self.llm = LLM(
File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/site-packages/vllm/entrypoints/llm.py”, line 297, in init
self.llm_engine = LLMEngine.from_engine_args(
File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py”, line 177, in from_engine_args
return cls(vllm_config=vllm_config,
File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py”, line 114, in init
self.engine_core = EngineCoreClient.make_client(
File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/site-packages/vllm/v1/engine/core_client.py”, line 80, in make_client
return SyncMPClient(vllm_config, executor_class, log_stats)
File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/site-packages/vllm/v1/engine/core_client.py”, line 602, in init
super().init(
File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/site-packages/vllm/v1/engine/core_client.py”, line 448, in init
with launch_core_engines(vllm_config, executor_class,
File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/contextlib.py”, line 142, in exit
next(self.gen)
File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/site-packages/vllm/v1/engine/utils.py”, line 732, in launch_core_engines
wait_for_engine_startup(
File “/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/site-packages/vllm/v1/engine/utils.py”, line 785, in wait_for_engine_startup
raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {‘EngineCore_DP0’: 1}
/home/ffc3/anaconda/anaconda3/envs/vllm_shr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’
Error occurred: Engine core initialization failed. See root cause above. Failed core proc(s): {‘EngineCore_DP0’: 1}
Generating ‘/tmp/nsys-report-7c78.qdstrm’
Failed to create ‘/home/ffc3/shr/./vllm_profile.nsys-rep’: File exists.
Use --force-overwrite true to overwrite existing files.
[1/1] [========================100%] nsys-report-4b76.nsys-rep
Generated:
/tmp/nsys-report-4b76.nsys-rep