Added second 5090 and turne on tensor parallel 2

youlearnit · September 18, 2025, 3:12pm

After running docker with tensor-parallel-size 2 I got these errors with dual 5090:

(VllmWorker TP0 pid=152) INFO 09-18 08:07:19 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver’s P2P report.
(VllmWorker TP1 pid=153) INFO 09-18 08:07:19 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver’s P2P report.
(VllmWorker TP0 pid=152) WARNING 09-18 08:07:19 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker TP1 pid=153) WARNING 09-18 08:07:19 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker TP0 pid=152) INFO 09-18 08:07:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, ‘psm_ef4d5655’), local_subscribe_addr=‘ipc:///tmp/cb7e2bd0-79e4-4891-97c2-08acd98f83de’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker TP0 pid=152) INFO 09-18 08:07:19 [parallel_state.py:1134] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker TP1 pid=153) INFO 09-18 08:07:19 [parallel_state.py:1134] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker TP0 pid=152) INFO 09-18 08:07:21 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling.
(VllmWorker TP1 pid=153) INFO 09-18 08:07:21 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling.
(VllmWorker TP1 pid=153) Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You’ll still be able to use a slow processor with use_fast=False.
(VllmWorker TP0 pid=152) Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You’ll still be able to use a slow processor with use_fast=False.
(VllmWorker TP0 pid=152) INFO 09-18 08:07:28 [gpu_model_runner.py:1953] Starting to load model google/gemma-3-12b-it…
(VllmWorker TP1 pid=153) INFO 09-18 08:07:29 [gpu_model_runner.py:1953] Starting to load model google/gemma-3-12b-it…
(VllmWorker TP0 pid=152) INFO 09-18 08:07:29 [gpu_model_runner.py:1985] Loading model from scratch…
(VllmWorker TP0 pid=152) INFO 09-18 08:07:29 [cuda.py:345] Using FlexAttention backend for head_size=72 on V1 engine.
(VllmWorker TP0 pid=152) WARNING 09-18 08:07:29 [layer.py:52] Xformers is not available, falling back.
(VllmWorker TP1 pid=153) INFO 09-18 08:07:29 [gpu_model_runner.py:1985] Loading model from scratch…
(VllmWorker TP0 pid=152) INFO 09-18 08:07:29 [cuda.py:328] Using Flash Attention backend on V1 engine.
(VllmWorker TP1 pid=153) WARNING 09-18 08:07:29 [layer.py:52] Xformers is not available, falling back.
(VllmWorker TP0 pid=152) INFO 09-18 08:07:29 [weight_utils.py:296] Using model weights format [‘.safetensors’]
(VllmWorker TP1 pid=153) INFO 09-18 08:07:30 [weight_utils.py:296] Using model weights format ['.safetensors’]
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:04<00:17, 4.49s/it]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:09<00:13, 4.65s/it]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:13<00:09, 4.68s/it]
(VllmWorker TP0 pid=152)
(VllmWorker TP1 pid=153) INFO 09-18 08:07:54 [gpu_model_runner.py:2591] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 7 image items of the maximum feature size.
(VllmWorker TP0 pid=152) INFO 09-18 08:07:54 [gpu_model_runner.py:2591] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 7 image items of the maximum feature size.
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] WorkerProc hit an exception.
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] Traceback (most recent call last):
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py”, line 591, in worker_busy_loop
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] output = func(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py”, line 116, in decorate_context
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return func(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py”, line 244, in determine_available_memory
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] self.model_runner.profile_run()
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py”, line 2608, in profile_run
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] self.model.get_multimodal_embeddings(
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py”, line 590, in get_multimodal_embeddings
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return self._process_image_input(image_input)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py”, line 571, in _process_image_input
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py”, line 560, in _image_pixels_to_features
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return vision_tower(pixel_values)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _wrapped_call_impl
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return self._call_impl(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1762, in _call_impl
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return forward_call(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py”, line 479, in forward
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return self.vision_model(
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _wrapped_call_impl
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return self._call_impl(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1762, in _call_impl
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return forward_call(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py”, line 430, in forward
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] encoder_outputs = self.encoder(
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _wrapped_call_impl
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return self._call_impl(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1762, in _call_impl
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return forward_call(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py”, line 318, in forward
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] hidden_states, _ = encoder_layer(hidden_states)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _wrapped_call_impl
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return self._call_impl(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1762, in _call_impl
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return forward_call(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py”, line 272, in forward
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] hidden_states = self.layer_norm1(hidden_states)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _wrapped_call_impl
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return self._call_impl(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1762, in _call_impl
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return forward_call(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/normalization.py”, line 217, in forward
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return F.layer_norm(
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/functional.py”, line 2910, in layer_norm
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return torch.layer_norm(
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] RuntimeError: CUDA error: an illegal memory access was encountered
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596]
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] Traceback (most recent call last):
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py”, line 591, in worker_busy_loop
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] output = func(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py”, line 116, in decorate_context
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return func(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py”, line 244, in determine_available_memory
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] self.model_runner.profile_run()
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py”, line 2608, in profile_run
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] self.model.get_multimodal_embeddings(
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py”, line 590, in get_multimodal_embeddings
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return self._process_image_input(image_input)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py”, line 571, in _process_image_input
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] image_features = self._image_pixels_to_features(
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py”, line 560, in _image_pixels_to_features
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return vision_tower(pixel_values)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _wrapped_call_impl
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _wrapped_call_impl
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return self._call_impl(*args, **kwargs)
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1762, in _call_impl
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return forward_call(*args, **kwargs)
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py”, line 318, in forward
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] hidden_states, _ = encoder_layer(hidden_states)
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596]

(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] RuntimeError: CUDA error: an illegal memory access was encountered
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] EngineCore failed to start.
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] Traceback (most recent call last):
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 691, in run_engine_core
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] self._initialize_kv_caches(vllm_config)
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 179, in _initialize_kv_caches
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] self.model_executor.determine_available_memory())
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700]
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py”, line 76, in determine_available_memory
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] output = self.collective_rpc(“determine_available_memory”)
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700]
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py”, line 243, in collective_rpc
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] result = get_response(w, dequeue_timeout)
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py”, line 230, in get_response
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] raise RuntimeError(
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] RuntimeError: Worker failed with error 'CUDA error: an illegal memory access was encountered
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] ', please check the stack trace above for the root cause
(EngineCore_0 pid=102) ERROR 09-18 08:08:00 [multiproc_executor.py:146] Worker proc VllmWorker-1 died unexpectedly, shutting down executor.
(EngineCore_0 pid=102) Process EngineCore_0:
(EngineCore_0 pid=102) Traceback (most recent call last):
(EngineCore_0 pid=102) File “/usr/lib/python3.12/multiprocessing/process.py”, line 314, in _bootstrap
(EngineCore_0 pid=102) self.run()
(EngineCore_0 pid=102) File “/usr/lib/python3.12/multiprocessing/process.py”, line 108, in run
(EngineCore_0 pid=102) self._target(*self._args, **self._kwargs)
(EngineCore_0 pid=102) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 704, in run_engine_core
(EngineCore_0 pid=102) raise e
(EngineCore_0 pid=102) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 691, in run_engine_core
(EngineCore_0 pid=102) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=102) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=102) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=102) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py”, line 230, in get_response
(EngineCore_0 pid=102) raise RuntimeError(
(EngineCore_0 pid=102) RuntimeError: Worker failed with error 'CUDA error: an illegal memory access was encountered
(EngineCore_0 pid=102) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_0 pid=102) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_0 pid=102) Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(EngineCore_0 pid=102) ', please check the stack trace above for the root cause
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File “”, line 198, in _run_module_as_main
(APIServer pid=1) File “”, line 88, in _run_code
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 1920, in
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/uvloop/init.py”, line 109, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/asyncio/runners.py”, line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/asyncio/runners.py”, line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “uvloop/loop.pyx”, line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/uvloop/init.py”, line 61, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 1850, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 1870, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/contextlib.py”, line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 178, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/contextlib.py”, line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 220, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/utils/init.py”, line 1557, in inner
(APIServer pid=1) return fn(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 174, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 120, in init
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 102, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 767, in init
(APIServer pid=1) super().init(
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 446, in init
(APIServer pid=1) with launch_core_engines(vllm_config, executor_class,
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/contextlib.py”, line 144, in exit
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py”, line 706, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py”, line 759, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

RunLLM · September 18, 2025, 3:13pm

Your error shows “CUDA error: an illegal memory access was encountered” during model initialization with tensor-parallel-size 2 on dual RTX 5090. This is a known issue with vLLM on Blackwell GPUs (5090) when using tensor parallelism, often related to NCCL, CUDA, or kernel compatibility, and sometimes to P2P (peer-to-peer) GPU communication being unavailable or failing the test. The warning “Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed” is expected on some systems and not always fatal, but the illegal memory access is a critical error that causes the worker to crash and the engine to fail startup. This is frequently reported with multi-GPU setups on 5090s and similar cards, especially with multi-modal or vision models, and is not yet fully resolved in vLLM as of September 2025 (see issue #23814, #24107, #24815).

Workarounds include: (1) running with tensor-parallel-size 1 (single GPU), (2) ensuring you have the latest vLLM, PyTorch, CUDA, and NCCL versions, (3) setting CUDA_LAUNCH_BLOCKING=1 for debugging, (4) trying different NCCL versions (e.g., 2.27.7), and (5) making sure Docker is run with --ipc=host and sufficient shared memory. There is no guaranteed fix for this on dual 5090s with tensor parallelism as of now; the vLLM team is actively investigating (see troubleshooting).

Would you like a step-by-step checklist for debugging or more details on possible configuration changes?

Sources:

youlearnit · September 18, 2025, 3:17pm

So has Nvidia intentionally done this for the gaming GPUs to prevent them for use in AI? Or is this also RTX PRO issue?

RunLLM · September 18, 2025, 3:17pm

This is not limited to gaming GPUs; similar peer-to-peer (P2P) and tensor parallelism issues have been reported on both GeForce (gaming) and RTX PRO (workstation) Blackwell GPUs, including the RTX 5090 and RTX PRO 6000 Blackwell. The problem is not an intentional restriction by NVIDIA for AI workloads, but rather a combination of new hardware, evolving CUDA/NCCL support, and driver/software maturity—P2P and multi-GPU communication are still being stabilized for these new architectures (vLLM issue #14628, vLLM issue #23781).

Workstation GPUs (RTX PRO) are supposed to have better P2P/NVLink support, but users have reported the same NCCL and illegal memory access errors on both types. Upgrading to the latest NCCL (≥2.27.x) and drivers sometimes resolves the issue, but not always. This is a technical limitation, not a deliberate lockout for AI (vLLM issue #23781, vLLM issue #14628).

Would you like more detail on the technical background or current workarounds?

Sources:

youlearnit · September 18, 2025, 3:30pm

yes how to upgrade to latest nccl

RunLLM · September 18, 2025, 3:31pm

To upgrade to the latest NCCL, use pip to install the newest version of the nvidia-nccl-cu12 package. For example, run:

pip install --upgrade nvidia-nccl-cu12 --extra-index-url https://pypi.nvidia.com

This will fetch the latest NCCL release compatible with CUDA 12.x, which is required for Blackwell GPUs. After upgrading, restart your environment and verify the version with:

python -c "import torch; print(torch.cuda.nccl.version())"

Would you like more detail on troubleshooting NCCL installation or Docker-specific steps?

Sources:

youlearnit · September 18, 2025, 3:33pm

Should this be done inside vllm docker or in the VM ?

RunLLM · September 18, 2025, 3:34pm

You should upgrade NCCL inside the vLLM Docker container, not just on the VM/host. The container has its own Python environment and dependencies, so changes on the host do not affect the container’s runtime (vLLM Docker deployment docs).

To do this, add a line like RUN pip install --upgrade nvidia-nccl-cu12 --extra-index-url https://pypi.nvidia.com to your Dockerfile, or run it interactively inside the container with docker exec. Restart the container after upgrading.

Would you like a sample Dockerfile or step-by-step commands?

Sources:

vLLM Docker deployment docs

youlearnit · September 18, 2025, 3:43pm

pip install --upgrade nvidia-nccl-cu12 --extra-index-url https://pypi.nvidia.com
Looking in indexes: Simple index, https://pypi.nvidia.com
Requirement already satisfied: nvidia-nccl-cu12 in /usr/local/lib/python3.12/dist-packages (2.26.2)
Collecting nvidia-nccl-cu12
Downloading https://pypi.nvidia.com/nvidia-nccl-cu12/nvidia_nccl_cu12-2.28.3-py3-none-manylinux_2_18_x86_64.whl (295.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 295.9/295.9 MB 72.0 MB/s 0:00:04
Installing collected packages: nvidia-nccl-cu12
Attempting uninstall: nvidia-nccl-cu12
Found existing installation: nvidia-nccl-cu12 2.26.2
Uninstalling nvidia-nccl-cu12-2.26.2:
Successfully uninstalled nvidia-nccl-cu12-2.26.2
ERROR: pip’s dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torch 2.7.1+cu128 requires nvidia-nccl-cu12==2.26.2; platform_system == “Linux” and platform_machine == “x86_64”, but you have nvidia-nccl-cu12 2.28.3 which is incompatible.
Successfully installed nvidia-nccl-cu12-2.28.3
WARNING: Running pip as the ‘root’ user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: 12. Virtual Environments and Packages — Python 3.13.7 documentation. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
root@ubuntuAIhome01:/vllm-workspace# python -c “import torch; print(torch.cuda.nccl.version())”
bash: python: command not found
root@ubuntuAIhome01:/vllm-workspace# python3 -c “import torch; print(torch.cuda.nccl.version())”
(2, 26, 2)

RunLLM · September 18, 2025, 3:52pm

Upgrading nvidia-nccl-cu12 via pip installs the latest NCCL, but PyTorch (e.g., 2.7.1+cu128) may still require the exact version it was built/tested with (here, 2.26.2). This causes a version mismatch: even after upgrading, torch.cuda.nccl.version() still reports (2, 26, 2), meaning PyTorch is loading its bundled NCCL, not the upgraded one (see vLLM issue #4195). This is a known limitation: PyTorch wheels pin NCCL tightly, and pip upgrades do not override the internal linkage.

To use a newer NCCL, you must (1) use a PyTorch build that supports the newer NCCL version, or (2) set the environment variable VLLM_NCCL_SO_PATH to point to your upgraded NCCL library, or (3) rebuild PyTorch from source with the desired NCCL. Otherwise, vLLM and PyTorch will continue to use the old NCCL version (see vLLM issue #4294).

Would you like step-by-step instructions for setting VLLM_NCCL_SO_PATH or rebuilding PyTorch?

Sources:

Topic		Replies	Views
vLLM does not work with 2x 5090 in tp 2 General	8	50	September 18, 2025
Help for error when run vllm with tensor parallel General	7	750	May 21, 2025
I got "NCCL error" when launch LLMEngine with data parallel = 2 General	1	120	July 31, 2025
Docker image `vllm/vllm-openai:v0.9.0` doesn't work on 5090 General	3	520	June 10, 2025
Can anyone help me? Why is this not working? It used 😭 NVIDIA GPU Support	1	518	May 8, 2025

Added second 5090 and turne on tensor parallel 2

Related topics