Help for error when run vllm with tensor parallel

Yixiao · May 21, 2025, 8:25am

I tried to restart my docker(then no Python code will touches CUDA before the start of vllm).
It works without the bug ahead, but it stuck on code below:

Last login: Wed May 21 16:17:10 on ttys002
░▒▓  │  ~ ······························································································· base  │ 16:21:01  ▓▒░─╮
❯ ssh epcc-gpu-A40-2                                                                                                                 ─╯
Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-88-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

 * Introducing Expanded Security Maintenance for Applications.
   Receive updates to over 25,000 software packages with your
   Ubuntu Pro subscription. Free for personal use.

     https://ubuntu.com/pro

Expanded Security Maintenance for Infrastructure is not enabled.

401 updates can be applied immediately.
318 of these updates are standard security updates.
To see these additional updates run: apt list --upgradable

Enable ESM Infra to receive additional future security updates.
See https://ubuntu.com/esm or run: sudo pro status

New release '22.04.5 LTS' available.
Run 'do-release-upgrade' to upgrade to it.

Your Hardware Enablement Stack (HWE) is supported until April 2025.
Web console: https://a40-02:9090/ or https://10.2.64.74:9090/

Last login: Wed May 21 16:17:13 2025 from 10.2.0.1
 ~ ................................................................................................. base py  yixwang@a40-02  16:21:04
> docker restart yixwang-elora
yixwang-elora

 ~ ............................................................................................ 12s  base py  yixwang@a40-02  16:21:20
> docker exec -it yixwang-elora zsh
  /workspace                                                                                       base root@01e648050431  08:21:23
❯ cd ~

  ~                                                                                                base root@01e648050431  08:21:26
❯ conda activate vllm

  ~                                                                                                vllm root@01e648050431  08:21:30
❯ ray start --head
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 172.17.0.3

--------------------
Ray runtime started.
--------------------

Next steps
  To add another node to this Ray cluster, run
    ray start --address='172.17.0.3:6379'

  To connect to this Ray cluster:
    import ray
    ray.init()

  To terminate the Ray runtime, run
    ray stop

  To view the status of the cluster, use
    ray status

  ~                                                                                           3s  vllm root@01e648050431  08:21:41
❯ python /root/vllm/vllm/entrypoints/openai/api_server.py --model meta-llama/Llama-2-7b-hf --tensor-parallel-size 2 --lora-modules lora1=/root/vllm/elora_helper/lora_models/llama-2-7b-sql-lora-test
/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/_distutils_hack/__init__.py:53: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
  warnings.warn(
INFO 05-21 08:21:54 api_server.py:154] vLLM API server version 0.4.0
INFO 05-21 08:21:54 api_server.py:155] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=[LoRA(name='lora1', local_path='/root/vllm/elora_helper/lora_models/llama-2-7b-sql-lora-test')], chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='meta-llama/Llama-2-7b-hf', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, seed=0, swap_space=4, gpu_memory_utilization=0.9, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2025-05-21 08:21:55,353	INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 172.17.0.3:6379...
2025-05-21 08:21:55,370	INFO worker.py:1841 -- Connected to Ray cluster.
INFO 05-21 08:21:57 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='meta-llama/Llama-2-7b-hf', tokenizer='meta-llama/Llama-2-7b-hf', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 05-21 08:22:04 selector.py:16] Using FlashAttention backend.
(RayWorkerVllm pid=2475) /root/.anaconda3/envs/vllm/lib/python3.9/site-packages/_distutils_hack/__init__.py:53: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
(RayWorkerVllm pid=2475)   warnings.warn(
(RayWorkerVllm pid=2475) INFO 05-21 08:22:06 selector.py:16] Using FlashAttention backend.
INFO 05-21 08:22:06 pynccl_utils.py:45] vLLM is using nccl==2.18.1
(RayWorkerVllm pid=2475) INFO 05-21 08:22:06 pynccl_utils.py:45] vLLM is using nccl==2.18.1

Topic		Replies	Views
Added second 5090 and turne on tensor parallel 2 General	9	476	September 18, 2025
NCCL error across 2 machines 2x4GPUs need advice General	2	493	November 12, 2025
vLLM Tensor Parallel Workers Not Completing Initialization General	5	1552	May 4, 2026
vLLM does not work with 2x 5090 in tp 2 General	8	902	September 18, 2025
RuntimeError: CUDA driver error: invalid device ordinal after the update to v0.11.0 General	5	589	October 27, 2025

Help for error when run vllm with tensor parallel

Related topics