部署qwen3-32b报错,不知道是什么原因

启动命令:

CUDA_VISIBLE_DEVICES=0,1,2,5 vllm serve /mnt/afs/share_models/git_models/Qwen/Qwen3-32B \
>     --served-model-name Qwen3-32B \
>     --port 6669 \
>     --host 0.0.0.0 \
>     --tensor-parallel-size 4 \
>     --dtype half \
>     --enable-chunked-prefill \
>     --enable-prefix-caching \

日志:

(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522] WorkerProc hit an exception.
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522] Traceback (most recent call last):
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 517, in worker_busy_loop
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     output = func(*args, **kwargs)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]              ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     return func(*args, **kwargs)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 185, in determine_available_memory
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     self.model_runner.profile_run()
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1899, in profile_run
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     sampler_output = self._dummy_sampler_run(hidden_states)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     return func(*args, **kwargs)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1760, in _dummy_sampler_run
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     logits = self.model.compute_logits(hidden_states, None)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/model_executor/models/qwen3.py", line 309, in compute_logits
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     logits = self.logits_processor(self.lm_head, hidden_states,
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     return self._call_impl(*args, **kwargs)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     return forward_call(*args, **kwargs)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py", line 70, in forward
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     logits = self._get_logits(hidden_states, lm_head, embedding_bias)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py", line 113, in _get_logits
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     logits = self._gather_logits(logits)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py", line 95, in _gather_logits
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     logits = tensor_model_parallel_all_gather(logits)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/distributed/communication_op.py", line 19, in tensor_model_parallel_all_gather
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     return get_tp_group().all_gather(input_, dim)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 372, in all_gather
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     return torch.ops.vllm.all_gather(input_,
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/torch/_ops.py", line 1158, in __call__
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     return self._op(*args, **(kwargs or {}))
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 138, in all_gather
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     return group._all_gather_out_place(tensor, dim)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 381, in _all_gather_out_place
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     return self.device_communicator.all_gather(input_, dim)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/distributed/device_communicators/base_device_communicator.py", line 129, in all_gather
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     dist.all_gather_into_tensor(output_tensor,
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     return func(*args, **kwargs)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3836, in all_gather_into_tensor
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     work = group._allgather_base(output_tensor, input_tensor, opts)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522] torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:77, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522] ncclUnhandledCudaError: Call to CUDA function failed.
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522] Last error:
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522] Failed to CUDA calloc 2097152 bytes
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522] Traceback (most recent call last):
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 517, in worker_busy_loop
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     output = func(*args, **kwargs)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]              ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     return func(*args, **kwargs)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 185, in determine_available_memory
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     self.model_runner.profile_run()
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1899, in profile_run
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     sampler_output = self._dummy_sampler_run(hidden_states)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     return func(*args, **kwargs)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1760, in _dummy_sampler_run
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     logits = self.model.compute_logits(hidden_states, None)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/model_executor/models/qwen3.py", line 309, in compute_logits
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     logits = self.logits_processor(self.lm_head, hidden_states,
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     return self._call_impl(*args, **kwargs)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     return forward_call(*args, **kwargs)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py", line 70, in forward
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     logits = self._get_logits(hidden_states, lm_head, embedding_bias)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py", line 113, in _get_logits
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     logits = self._gather_logits(logits)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py", line 95, in _gather_logits
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     logits = tensor_model_parallel_all_gather(logits)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/distributed/communication_op.py", line 19, in tensor_model_parallel_all_gather
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     return get_tp_group().all_gather(input_, dim)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 372, in all_gather
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     return torch.ops.vllm.all_gather(input_,
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/torch/_ops.py", line 1158, in __call__
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     return self._op(*args, **(kwargs or {}))
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 138, in all_gather
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     return group._all_gather_out_place(tensor, dim)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 381, in _all_gather_out_place
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     return self.device_communicator.all_gather(input_, dim)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/distributed/device_communicators/base_device_communicator.py", line 129, in all_gather
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     dist.all_gather_into_tensor(output_tensor,
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     return func(*args, **kwargs)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3836, in all_gather_into_tensor
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]     work = group._allgather_base(output_tensor, input_tensor, opts)
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522] torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:77, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522] ncclUnhandledCudaError: Call to CUDA function failed.
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522] Last error:
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522] Failed to CUDA calloc 2097152 bytes
(VllmWorker rank=3 pid=457669) ERROR 07-30 17:25:21 [multiproc_executor.py:522] 

你的报错信息显示 torch.distributed.DistBackendError: NCCL error ... ncclUnhandledCudaError: Call to CUDA function failed. ... Failed to CUDA calloc ...,这是典型的 NCCL 通信或 CUDA 显存分配失败问题。常见原因包括:1)GPU 显存不足,2)/dev/shm 共享内存不足,3)NCCL 环境变量或驱动配置不当,4)多卡通信异常。

建议排查步骤如下:

  1. 检查每张卡的显存是否足够,Qwen3-32B 半精度单卡需约32GB,4卡并行每卡也需8GB+,但实际还需留出通信和缓存空间,建议每卡至少24GB。
  2. 如果用 Docker 或 K8s,需设置 --shm-size 至至少16G,并挂载 /dev/shm,否则 NCCL 无法分配共享内存,详见官方文档
  3. 检查 NCCL 相关环境变量,如 NCCL_CUMEM_ENABLE=0,并确保所有进程环境一致,避免 NCCL 版本不兼容或参数冲突,详见常见 NCCL 问题说明
  4. 若仍报错,可尝试减小 batch size、max_model_len 或 max_num_seqs,或用 enforce_eager=True 关闭 CUDA graph,减少显存占用,详见内存优化建议

需要更详细的排查步骤或配置建议吗?

Sources:

我的显存是4张卡,每张卡95g,显存不会有问题

你的每张卡有95GB显存,理论上Qwen3-32B模型4卡并行显存是足够的。根据报错,问题更可能出在NCCL通信或共享内存(/dev/shm)配置上,而不是显存不足。

建议重点检查:

  1. /dev/shm 共享内存是否足够,建议设置 --shm-size=32g 或更大,或用 --ipc=host,否则NCCL无法正常分配通信内存,详见官方说明
  2. 确认所有GPU编号连续(如0,1,2,3),避免跨NUMA或PCIe拓扑异常导致NCCL通信失败,详见相关讨论
  3. 检查NCCL环境变量设置,建议加上 export NCCL_CUMEM_ENABLE=0,并确保驱动、CUDA、NCCL版本兼容。

如仍有问题,可进一步提供nvidia-smi topo -m输出和完整启动命令。

需要更详细的排查方法吗?

Sources: