After running docker with tensor-parallel-size 2 I got these errors with dual 5090:
(VllmWorker TP0 pid=152) INFO 09-18 08:07:19 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver’s P2P report.
(VllmWorker TP1 pid=153) INFO 09-18 08:07:19 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver’s P2P report.
(VllmWorker TP0 pid=152) WARNING 09-18 08:07:19 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker TP1 pid=153) WARNING 09-18 08:07:19 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker TP0 pid=152) INFO 09-18 08:07:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, ‘psm_ef4d5655’), local_subscribe_addr=‘ipc:///tmp/cb7e2bd0-79e4-4891-97c2-08acd98f83de’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker TP0 pid=152) INFO 09-18 08:07:19 [parallel_state.py:1134] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker TP1 pid=153) INFO 09-18 08:07:19 [parallel_state.py:1134] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker TP0 pid=152) INFO 09-18 08:07:21 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling.
(VllmWorker TP1 pid=153) INFO 09-18 08:07:21 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling.
(VllmWorker TP1 pid=153) Using a slow image processor as use_fast
is unset and a slow processor was saved with this model. use_fast=True
will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You’ll still be able to use a slow processor with use_fast=False
.
(VllmWorker TP0 pid=152) Using a slow image processor as use_fast
is unset and a slow processor was saved with this model. use_fast=True
will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You’ll still be able to use a slow processor with use_fast=False
.
(VllmWorker TP0 pid=152) INFO 09-18 08:07:28 [gpu_model_runner.py:1953] Starting to load model google/gemma-3-12b-it…
(VllmWorker TP1 pid=153) INFO 09-18 08:07:29 [gpu_model_runner.py:1953] Starting to load model google/gemma-3-12b-it…
(VllmWorker TP0 pid=152) INFO 09-18 08:07:29 [gpu_model_runner.py:1985] Loading model from scratch…
(VllmWorker TP0 pid=152) INFO 09-18 08:07:29 [cuda.py:345] Using FlexAttention backend for head_size=72 on V1 engine.
(VllmWorker TP0 pid=152) WARNING 09-18 08:07:29 [layer.py:52] Xformers is not available, falling back.
(VllmWorker TP1 pid=153) INFO 09-18 08:07:29 [gpu_model_runner.py:1985] Loading model from scratch…
(VllmWorker TP0 pid=152) INFO 09-18 08:07:29 [cuda.py:328] Using Flash Attention backend on V1 engine.
(VllmWorker TP1 pid=153) WARNING 09-18 08:07:29 [layer.py:52] Xformers is not available, falling back.
(VllmWorker TP0 pid=152) INFO 09-18 08:07:29 [weight_utils.py:296] Using model weights format [‘.safetensors’]
(VllmWorker TP1 pid=153) INFO 09-18 08:07:30 [weight_utils.py:296] Using model weights format ['.safetensors’]
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:04<00:17, 4.49s/it]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:09<00:13, 4.65s/it]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:13<00:09, 4.68s/it]
(VllmWorker TP0 pid=152)
(VllmWorker TP1 pid=153) INFO 09-18 08:07:54 [gpu_model_runner.py:2591] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 7 image items of the maximum feature size.
(VllmWorker TP0 pid=152) INFO 09-18 08:07:54 [gpu_model_runner.py:2591] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 7 image items of the maximum feature size.
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] WorkerProc hit an exception.
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] Traceback (most recent call last):
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py”, line 591, in worker_busy_loop
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] output = func(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py”, line 116, in decorate_context
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return func(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py”, line 244, in determine_available_memory
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] self.model_runner.profile_run()
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py”, line 2608, in profile_run
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] self.model.get_multimodal_embeddings(
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py”, line 590, in get_multimodal_embeddings
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return self._process_image_input(image_input)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py”, line 571, in _process_image_input
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py”, line 560, in _image_pixels_to_features
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return vision_tower(pixel_values)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _wrapped_call_impl
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return self._call_impl(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1762, in _call_impl
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return forward_call(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py”, line 479, in forward
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return self.vision_model(
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _wrapped_call_impl
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return self._call_impl(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1762, in _call_impl
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return forward_call(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py”, line 430, in forward
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] encoder_outputs = self.encoder(
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _wrapped_call_impl
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return self._call_impl(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1762, in _call_impl
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return forward_call(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py”, line 318, in forward
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] hidden_states, _ = encoder_layer(hidden_states)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _wrapped_call_impl
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return self._call_impl(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1762, in _call_impl
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return forward_call(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py”, line 272, in forward
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] hidden_states = self.layer_norm1(hidden_states)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _wrapped_call_impl
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return self._call_impl(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1762, in _call_impl
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return forward_call(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/normalization.py”, line 217, in forward
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return F.layer_norm(
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/functional.py”, line 2910, in layer_norm
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return torch.layer_norm(
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] RuntimeError: CUDA error: an illegal memory access was encountered
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596]
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] Traceback (most recent call last):
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py”, line 591, in worker_busy_loop
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] output = func(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py”, line 116, in decorate_context
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return func(*args, **kwargs)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py”, line 244, in determine_available_memory
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] self.model_runner.profile_run()
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py”, line 2608, in profile_run
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] self.model.get_multimodal_embeddings(
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py”, line 590, in get_multimodal_embeddings
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return self._process_image_input(image_input)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py”, line 571, in _process_image_input
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] image_features = self._image_pixels_to_features(
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py”, line 560, in _image_pixels_to_features
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return vision_tower(pixel_values)
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=153) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _wrapped_call_impl
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _wrapped_call_impl
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return self._call_impl(*args, **kwargs)
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1762, in _call_impl
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] return forward_call(*args, **kwargs)
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py”, line 318, in forward
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] hidden_states, _ = encoder_layer(hidden_states)
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596]
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] RuntimeError: CUDA error: an illegal memory access was encountered
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorker TP0 pid=152) ERROR 09-18 08:07:57 [multiproc_executor.py:596] Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] EngineCore failed to start.
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] Traceback (most recent call last):
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 691, in run_engine_core
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] self._initialize_kv_caches(vllm_config)
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 179, in _initialize_kv_caches
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] self.model_executor.determine_available_memory())
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700]
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py”, line 76, in determine_available_memory
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] output = self.collective_rpc(“determine_available_memory”)
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700]
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py”, line 243, in collective_rpc
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] result = get_response(w, dequeue_timeout)
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py”, line 230, in get_response
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] raise RuntimeError(
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] RuntimeError: Worker failed with error 'CUDA error: an illegal memory access was encountered
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
(EngineCore_0 pid=102) ERROR 09-18 08:07:57 [core.py:700] ', please check the stack trace above for the root cause
(EngineCore_0 pid=102) ERROR 09-18 08:08:00 [multiproc_executor.py:146] Worker proc VllmWorker-1 died unexpectedly, shutting down executor.
(EngineCore_0 pid=102) Process EngineCore_0:
(EngineCore_0 pid=102) Traceback (most recent call last):
(EngineCore_0 pid=102) File “/usr/lib/python3.12/multiprocessing/process.py”, line 314, in _bootstrap
(EngineCore_0 pid=102) self.run()
(EngineCore_0 pid=102) File “/usr/lib/python3.12/multiprocessing/process.py”, line 108, in run
(EngineCore_0 pid=102) self._target(*self._args, **self._kwargs)
(EngineCore_0 pid=102) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 704, in run_engine_core
(EngineCore_0 pid=102) raise e
(EngineCore_0 pid=102) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 691, in run_engine_core
(EngineCore_0 pid=102) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=102) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=102) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=102) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py”, line 230, in get_response
(EngineCore_0 pid=102) raise RuntimeError(
(EngineCore_0 pid=102) RuntimeError: Worker failed with error 'CUDA error: an illegal memory access was encountered
(EngineCore_0 pid=102) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_0 pid=102) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_0 pid=102) Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
(EngineCore_0 pid=102) ', please check the stack trace above for the root cause
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File “”, line 198, in _run_module_as_main
(APIServer pid=1) File “”, line 88, in _run_code
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 1920, in
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/uvloop/init.py”, line 109, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/asyncio/runners.py”, line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/asyncio/runners.py”, line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “uvloop/loop.pyx”, line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/uvloop/init.py”, line 61, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 1850, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 1870, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/contextlib.py”, line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 178, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/contextlib.py”, line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 220, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/utils/init.py”, line 1557, in inner
(APIServer pid=1) return fn(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 174, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 120, in init
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 102, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 767, in init
(APIServer pid=1) super().init(
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 446, in init
(APIServer pid=1) with launch_core_engines(vllm_config, executor_class,
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/contextlib.py”, line 144, in exit
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py”, line 706, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py”, line 759, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}