when I tried to run vllm with tensor parallel, it come up with the error below:
~/vllm elora-dev !3 ?10 vllm root@01e648050431 07:44:25
❯ ray stop
Stopped all 5 Ray processes.
~/vllm elora-dev !3 ?10 4s vllm root@01e648050431 07:44:31
❯ ray start --head
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
Local node IP: 172.17.0.3
--------------------
Ray runtime started.
--------------------
Next steps
To add another node to this Ray cluster, run
ray start --address='172.17.0.3:6379'
To connect to this Ray cluster:
import ray
ray.init()
To terminate the Ray runtime, run
ray stop
To view the status of the cluster, use
ray status
~/vllm elora-dev !3 ?10 5s vllm root@01e648050431 07:44:40
❯ ray status
======== Autoscaler status: 2025-05-21 07:44:43.159157 ========
Node status
---------------------------------------------------------------
Active:
1 node_91ecea0e11ad34416378fdcad625e35f0a671ee556d872f4b208229d
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/48.0 CPU
0.0/4.0 GPU
0B/77.66GiB memory
0B/37.28GiB object_store_memory
Demands:
(no resource demands)
~/vllm elora-dev !3 ?10 vllm root@01e648050431 07:44:45
❯ python /root/vllm/vllm/entrypoints/openai/api_server.py --model meta-llama/Llama-2-7b-hf --tensor-parallel-size 2 --lora-modules lora1=/root/vllm/elora_helper/lora_models/llama-2-7b-sql-lora-test
/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/_distutils_hack/__init__.py:53: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
warnings.warn(
INFO 05-21 07:44:59 api_server.py:154] vLLM API server version 0.4.0
INFO 05-21 07:44:59 api_server.py:155] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=[LoRA(name='lora1', local_path='/root/vllm/elora_helper/lora_models/llama-2-7b-sql-lora-test')], chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='meta-llama/Llama-2-7b-hf', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, seed=0, swap_space=4, gpu_memory_utilization=0.9, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2025-05-21 07:45:01,217 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 172.17.0.3:6379...
2025-05-21 07:45:01,233 INFO worker.py:1841 -- Connected to Ray cluster.
INFO 05-21 07:45:03 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='meta-llama/Llama-2-7b-hf', tokenizer='meta-llama/Llama-2-7b-hf', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
(pid=133879) /root/.anaconda3/envs/vllm/lib/python3.9/site-packages/_distutils_hack/__init__.py:53: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
(pid=133879) warnings.warn(
(pid=133913) /root/.anaconda3/envs/vllm/lib/python3.9/site-packages/_distutils_hack/__init__.py:53: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
(pid=133913) warnings.warn(
2025-05-21 07:45:14,499 ERROR worker.py:422 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerVllm.init_worker() (pid=133913, ip=172.17.0.3, actor_id=58c2b835b03451e98f02452701000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7fd818a8d670>)
File "/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/torch/cuda/__init__.py", line 180, in _check_capability
capability = get_device_capability(d)
File "/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/torch/cuda/__init__.py", line 435, in get_device_capability
prop = get_device_properties(device)
File "/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/torch/cuda/__init__.py", line 453, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=
The above exception was the direct cause of the following exception:
ray::RayWorkerVllm.init_worker() (pid=133913, ip=172.17.0.3, actor_id=58c2b835b03451e98f02452701000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7fd818a8d670>)
File "/root/vllm/vllm/engine/ray_utils.py", line 29, in init_worker
self.worker = worker_init_fn()
File "/root/vllm/vllm/executor/ray_gpu_executor.py", line 162, in <lambda>
lambda rank=rank, local_rank=local_rank: Worker(
File "/root/vllm/vllm/worker/worker.py", line 63, in __init__
self.model_runner = ModelRunner(
File "/root/vllm/vllm/worker/model_runner.py", line 90, in __init__
self.attn_backend = get_attn_backend(
File "/root/vllm/vllm/attention/selector.py", line 15, in get_attn_backend
if _can_use_flash_attn(dtype):
File "/root/vllm/vllm/attention/selector.py", line 32, in _can_use_flash_attn
if torch.cuda.get_device_capability()[0] < 8:
File "/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/torch/cuda/__init__.py", line 435, in get_device_capability
prop = get_device_properties(device)
File "/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/torch/cuda/__init__.py", line 449, in get_device_properties
_lazy_init() # will define _get_device_properties
File "/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/torch/cuda/__init__.py", line 317, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=
CUDA call was originally invoked at:
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 972, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 972, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 850, in exec_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "/root/vllm/vllm/__init__.py", line 3, in <module>
from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 850, in exec_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "/root/vllm/vllm/engine/arg_utils.py", line 6, in <module>
from vllm.config import (CacheConfig, DeviceConfig, LoRAConfig, ModelConfig,
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 850, in exec_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "/root/vllm/vllm/config.py", line 7, in <module>
import torch
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 850, in exec_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/torch/__init__.py", line 1332, in <module>
_C._initExtension(manager_path())
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 850, in exec_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/torch/cuda/__init__.py", line 244, in <module>
_lazy_call(_check_capability)
File "/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/torch/cuda/__init__.py", line 241, in _lazy_call
_queued_calls.append((callable, traceback.format_stack()))
INFO 05-21 07:45:14 selector.py:16] Using FlashAttention backend.
(RayWorkerVllm pid=133913) ERROR 05-21 07:45:14 ray_utils.py:44] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerVllm pid=133913) ERROR 05-21 07:45:14 ray_utils.py:44] Traceback (most recent call last):
(RayWorkerVllm pid=133913) ERROR 05-21 07:45:14 ray_utils.py:44] File "/root/vllm/vllm/engine/ray_utils.py", line 36, in execute_method
(RayWorkerVllm pid=133913) ERROR 05-21 07:45:14 ray_utils.py:44] executor = getattr(self, method)
(RayWorkerVllm pid=133913) ERROR 05-21 07:45:14 ray_utils.py:44] File "/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py", line 463, in _resume_span
(RayWorkerVllm pid=133913) ERROR 05-21 07:45:14 ray_utils.py:44] return method(self, *_args, **_kwargs)
(RayWorkerVllm pid=133913) ERROR 05-21 07:45:14 ray_utils.py:44] File "/root/vllm/vllm/engine/ray_utils.py", line 32, in __getattr__
(RayWorkerVllm pid=133913) ERROR 05-21 07:45:14 ray_utils.py:44] return getattr(self.worker, name)
(RayWorkerVllm pid=133913) ERROR 05-21 07:45:14 ray_utils.py:44] AttributeError: 'NoneType' object has no attribute 'init_device'
but, I find that there are 4 GPUs:
❯ python
Python 3.9.21 | packaged by conda-forge | (main, Dec 5 2024, 13:51:40)
[GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>>
>>> print(f"PyTorch version: {torch.__version__}")
PyTorch version: 2.1.2+cu121
>>> print(f"CUDA available: {torch.cuda.is_available()}")
CUDA available: True
>>> if torch.cuda.is_available():
... print(f"CUDA version: {torch.version.cuda}")
... print(f"Number of GPUs: {torch.cuda.device_count()}")
... for i in range(torch.cuda.device_count()):
... print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
... print(f" Capability: {torch.cuda.get_device_capability(i)}")
... else:
... print("CUDA not available. Please check your installation.")
...
CUDA version: 12.1
Number of GPUs: 4
GPU 0: NVIDIA A40
Capability: (8, 6)
GPU 1: NVIDIA A40
Capability: (8, 6)
GPU 2: NVIDIA A40
Capability: (8, 6)
GPU 3: NVIDIA A40
Capability: (8, 6)
also with nvidia-smi
nvidia-smi
Wed May 21 07:42:21 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A40 Off | 00000000:56:00.0 Off | 0 |
| 0% 43C P0 101W / 300W | 12271MiB / 46068MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A40 Off | 00000000:57:00.0 Off | 0 |
| 0% 30C P8 22W / 300W | 4MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A40 Off | 00000000:D1:00.0 Off | 0 |
| 0% 31C P8 23W / 300W | 4MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A40 Off | 00000000:D6:00.0 Off | 0 |
| 0% 33C P8 32W / 300W | 4MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
I would appreciate for any advice