I tried to restart my docker(then no Python code will touches CUDA before the start of vllm).
It works without the bug ahead, but it stuck on code below:
Last login: Wed May 21 16:17:10 on ttys002
░▒▓ │ ~ ······························································································· base │ 16:21:01 ▓▒░─╮
❯ ssh epcc-gpu-A40-2 ─╯
Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-88-generic x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
* Introducing Expanded Security Maintenance for Applications.
Receive updates to over 25,000 software packages with your
Ubuntu Pro subscription. Free for personal use.
https://ubuntu.com/pro
Expanded Security Maintenance for Infrastructure is not enabled.
401 updates can be applied immediately.
318 of these updates are standard security updates.
To see these additional updates run: apt list --upgradable
Enable ESM Infra to receive additional future security updates.
See https://ubuntu.com/esm or run: sudo pro status
New release '22.04.5 LTS' available.
Run 'do-release-upgrade' to upgrade to it.
Your Hardware Enablement Stack (HWE) is supported until April 2025.
Web console: https://a40-02:9090/ or https://10.2.64.74:9090/
Last login: Wed May 21 16:17:13 2025 from 10.2.0.1
~ ................................................................................................. base py yixwang@a40-02 16:21:04
> docker restart yixwang-elora
yixwang-elora
~ ............................................................................................ 12s base py yixwang@a40-02 16:21:20
> docker exec -it yixwang-elora zsh
/workspace base root@01e648050431 08:21:23
❯ cd ~
~ base root@01e648050431 08:21:26
❯ conda activate vllm
~ vllm root@01e648050431 08:21:30
❯ ray start --head
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
Local node IP: 172.17.0.3
--------------------
Ray runtime started.
--------------------
Next steps
To add another node to this Ray cluster, run
ray start --address='172.17.0.3:6379'
To connect to this Ray cluster:
import ray
ray.init()
To terminate the Ray runtime, run
ray stop
To view the status of the cluster, use
ray status
~ 3s vllm root@01e648050431 08:21:41
❯ python /root/vllm/vllm/entrypoints/openai/api_server.py --model meta-llama/Llama-2-7b-hf --tensor-parallel-size 2 --lora-modules lora1=/root/vllm/elora_helper/lora_models/llama-2-7b-sql-lora-test
/root/.anaconda3/envs/vllm/lib/python3.9/site-packages/_distutils_hack/__init__.py:53: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
warnings.warn(
INFO 05-21 08:21:54 api_server.py:154] vLLM API server version 0.4.0
INFO 05-21 08:21:54 api_server.py:155] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=[LoRA(name='lora1', local_path='/root/vllm/elora_helper/lora_models/llama-2-7b-sql-lora-test')], chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='meta-llama/Llama-2-7b-hf', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, seed=0, swap_space=4, gpu_memory_utilization=0.9, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2025-05-21 08:21:55,353 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 172.17.0.3:6379...
2025-05-21 08:21:55,370 INFO worker.py:1841 -- Connected to Ray cluster.
INFO 05-21 08:21:57 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='meta-llama/Llama-2-7b-hf', tokenizer='meta-llama/Llama-2-7b-hf', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 05-21 08:22:04 selector.py:16] Using FlashAttention backend.
(RayWorkerVllm pid=2475) /root/.anaconda3/envs/vllm/lib/python3.9/site-packages/_distutils_hack/__init__.py:53: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
(RayWorkerVllm pid=2475) warnings.warn(
(RayWorkerVllm pid=2475) INFO 05-21 08:22:06 selector.py:16] Using FlashAttention backend.
INFO 05-21 08:22:06 pynccl_utils.py:45] vLLM is using nccl==2.18.1
(RayWorkerVllm pid=2475) INFO 05-21 08:22:06 pynccl_utils.py:45] vLLM is using nccl==2.18.1