I tested the latest VeRL cloned from github with the following command on the AMD cluster:
for data generation: python ~/verl/examples/data_preprocess/gsm8k.py for LLM training: bash ~/verl/examples/grpo_trainer/run_deepseek7b_llm.sh
I stilled get the error of No HIP GPUs are available
. But when I am testing vllm and torch using python in the interactive window, they do not cause any problems. This is so strange! Below is the test log file:
aiscuser@node-0:/scratch/azureml/cr/j/33a5aa2996f244f3ada3ec3029cdb09b/exe/wd$ rocm-smi
============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Junction) (Socket) (Mem, Compute, ID)
==========================================================================================================================
0 2 0x74b5, 65402 37.0°C 150.0W NPS1, N/A, 0 132Mhz 900Mhz 0% auto 750.0W 0% 0%
1 3 0x74b5, 27175 37.0°C 151.0W NPS1, N/A, 0 131Mhz 900Mhz 0% auto 750.0W 0% 0%
2 4 0x74b5, 16561 36.0°C 153.0W NPS1, N/A, 0 132Mhz 900Mhz 0% auto 750.0W 0% 0%
3 5 0x74b5, 54764 35.0°C 148.0W NPS1, N/A, 0 132Mhz 900Mhz 0% auto 750.0W 0% 0%
4 6 0x74b5, 10760 36.0°C 147.0W NPS1, N/A, 0 131Mhz 900Mhz 0% auto 750.0W 0% 0%
5 7 0x74b5, 48981 36.0°C 146.0W NPS1, N/A, 0 132Mhz 900Mhz 0% auto 750.0W 0% 0%
6 8 0x74b5, 32548 37.0°C 152.0W NPS1, N/A, 0 131Mhz 900Mhz 0% auto 750.0W 0% 0%
7 9 0x74b5, 60025 38.0°C 150.0W NPS1, N/A, 0 131Mhz 900Mhz 0% auto 750.0W 0% 0%
==========================================================================================================================
================================================== End of ROCm SMI Log ===================================================
aiscuser@node-0:/scratch/azureml/cr/j/33a5aa2996f244f3ada3ec3029cdb09b/exe/wd$ cd ~
aiscuser@node-0:~$ ls
azureml_job_env.sh hostfile hostfile.mpich samples tmp.7kLs tmp.lJNL tmp.Q9IC tmp.rHRS
aiscuser@node-0:~$ git clone https://github.com/volcengine/verl.git
Cloning into 'verl'...
remote: Enumerating objects: 4870, done.
remote: Counting objects: 100% (12/12), done.
remote: Compressing objects: 100% (11/11), done.
remote: Total 4870 (delta 1), reused 9 (delta 1), pack-reused 4858 (from 1)
Receiving objects: 100% (4870/4870), 3.05 MiB | 23.48 MiB/s, done.
Resolving deltas: 100% (3216/3216), done.
aiscuser@node-0:~$ ls
azureml_job_env.sh hostfile hostfile.mpich samples tmp.7kLs tmp.lJNL tmp.Q9IC tmp.rHRS verl
aiscuser@node-0:~$ cd verl
aiscuser@node-0:~/verl$ ls
docker docs examples LICENSE Notice.txt patches pyproject.toml README.md recipe requirements.txt scripts setup.py tests verl
aiscuser@node-0:~/verl$ source activate
(base) aiscuser@node-0:~/verl$ python ./examples/data_preprocess/gsm8k.py
README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.94k/7.94k [00:00<00:00, 66.5MB/s]
train-00000-of-00001.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.31M/2.31M [00:00<00:00, 100MB/s]
test-00000-of-00001.parquet: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 419k/419k [00:00<00:00, 288MB/s]
Generating train split: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7473/7473 [00:00<00:00, 463374.39 examples/s]
Generating test split: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:00<00:00, 436403.48 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7473/7473 [00:00<00:00, 33317.85 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:00<00:00, 34151.61 examples/s]
Creating parquet from Arrow format: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 305.33ba/s]
Creating parquet from Arrow format: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 413.03ba/s]
(base) aiscuser@node-0:~/verl$ bash ./examples/grpo_trainer/run_deepseek7b_llm.sh
+ python3 -m verl.trainer.main_ppo algorithm.adv_estimator=grpo data.train_files=/home/aiscuser/data/gsm8k/train.parquet data.val_files=/home/aiscuser/data/gsm8k/test.parquet data.train_batch_size=1024 data.max_prompt_length=512 data.max_response_length=1024 data.filter_overlong_prompts=True data.truncation=error actor_rollout_ref.model.path=deepseek-ai/deepseek-llm-7b-chat actor_rollout_ref.actor.optim.lr=1e-6 actor_rollout_ref.model.use_remove_padding=True actor_rollout_ref.actor.ppo_mini_batch_size=256 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=80 actor_rollout_ref.actor.use_kl_loss=True actor_rollout_ref.actor.kl_loss_coef=0.001 actor_rollout_ref.actor.kl_loss_type=low_var_kl actor_rollout_ref.model.enable_gradient_checkpointing=True actor_rollout_ref.actor.fsdp_config.param_offload=False actor_rollout_ref.actor.fsdp_config.optimizer_offload=False actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=160 actor_rollout_ref.rollout.tensor_model_parallel_size=2 actor_rollout_ref.rollout.name=vllm actor_rollout_ref.rollout.gpu_memory_utilization=0.6 actor_rollout_ref.rollout.n=5 actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=160 actor_rollout_ref.ref.fsdp_config.param_offload=True algorithm.kl_ctrl.kl_coef=0.001 trainer.critic_warmup=0 'trainer.logger=[console]' trainer.project_name=verl_grpo_example_gsm8k trainer.experiment_name=deepseek_llm_7b_function_rm trainer.n_gpus_per_node=8 trainer.nnodes=1 trainer.save_freq=-1 trainer.test_freq=5 trainer.total_epochs=15
2025-03-26 02:55:02,518 INFO worker.py:1843 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
(TaskRunner pid=23702) {'actor_rollout_ref': {'actor': {'checkpoint': {'contents': ['model',
(TaskRunner pid=23702) 'hf_model',
(TaskRunner pid=23702) 'optimizer',
(TaskRunner pid=23702) 'extra']},
(TaskRunner pid=23702) 'clip_ratio': 0.2,
(TaskRunner pid=23702) 'entropy_coeff': 0.001,
(TaskRunner pid=23702) 'fsdp_config': {'fsdp_size': -1,
(TaskRunner pid=23702) 'optimizer_offload': False,
(TaskRunner pid=23702) 'param_offload': False,
(TaskRunner pid=23702) 'wrap_policy': {'min_num_params': 0}},
(TaskRunner pid=23702) 'grad_clip': 1.0,
(TaskRunner pid=23702) 'kl_loss_coef': 0.001,
(TaskRunner pid=23702) 'kl_loss_type': 'low_var_kl',
(TaskRunner pid=23702) 'optim': {'lr': 1e-06,
(TaskRunner pid=23702) 'lr_warmup_steps': -1,
(TaskRunner pid=23702) 'lr_warmup_steps_ratio': 0.0,
...
(TaskRunner pid=23702) 'project_name': 'verl_grpo_example_gsm8k',
(TaskRunner pid=23702) 'remove_previous_ckpt_in_save': False,
(TaskRunner pid=23702) 'resume_from_path': False,
(TaskRunner pid=23702) 'resume_mode': 'auto',
(TaskRunner pid=23702) 'save_freq': -1,
(TaskRunner pid=23702) 'test_freq': 5,
(TaskRunner pid=23702) 'total_epochs': 15,
(TaskRunner pid=23702) 'total_training_steps': None,
(TaskRunner pid=23702) 'val_generations_to_log_to_wandb': 0}}
(TaskRunner pid=23702) WARNING 03-26 02:55:11 rocm.py:13] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead.
Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=/home/aiscuser/data/gsm8k/train.parquet', 'data.val_files=/home/aiscuser/data/gsm8k/test.parquet', 'data.train_batch_size=1024', 'data.max_prompt_length=512', 'data.max_response_length=1024', 'data.filter_overlong_prompts=True', 'data.truncation=error', 'actor_rollout_ref.model.path=deepseek-ai/deepseek-llm-7b-chat', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=256', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=80', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.001', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.actor.fsdp_config.param_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=160', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.6', 'actor_rollout_ref.rollout.n=5', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=160', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=[console]', 'trainer.project_name=verl_grpo_example_gsm8k', 'trainer.experiment_name=deepseek_llm_7b_function_rm', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.save_freq=-1', 'trainer.test_freq=5', 'trainer.total_epochs=15']
Traceback (most recent call last):
File "/home/aiscuser/verl/verl/trainer/main_ppo.py", line 54, in main
run_ppo(config)
File "/home/aiscuser/verl/verl/trainer/main_ppo.py", line 72, in run_ppo
ray.get(runner.run.remote(config))
File "/opt/conda/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/ray/_private/worker.py", line 2782, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/ray/_private/worker.py", line 929, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::TaskRunner.run() (pid=23702, ip=100.65.3.152, actor_id=ef46086bf9037197bc3baabc01000000, repr=<main_ppo.TaskRunner object at 0x7f104f3f0f90>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aiscuser/verl/verl/trainer/main_ppo.py", line 97, in run
from verl.workers.fsdp_workers import ActorRolloutRefWorker, CriticWorker
File "/home/aiscuser/verl/verl/workers/fsdp_workers.py", line 41, in <module>
from verl.workers.sharding_manager.fsdp_ulysses import FSDPUlyssesShardingManager
File "/home/aiscuser/verl/verl/workers/sharding_manager/__init__.py", line 34, in <module>
if is_vllm_available():
^^^^^^^^^^^^^^^^^^^
File "/home/aiscuser/verl/verl/utils/import_utils.py", line 35, in is_vllm_available
import vllm
File "/vllm/vllm/__init__.py", line 3, in <module>
from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
File "/vllm/vllm/engine/arg_utils.py", line 11, in <module>
from vllm.config import (CacheConfig, ConfigFormat, DecodingConfig,
File "/vllm/vllm/config.py", line 12, in <module>
from vllm.model_executor.layers.quantization import QUANTIZATION_METHODS
File "/vllm/vllm/model_executor/layers/quantization/__init__.py", line 10, in <module>
from vllm.model_executor.layers.quantization.compressed_tensors.compressed_tensors import ( # noqa: E501
File "/vllm/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 11, in <module>
from vllm.model_executor.layers.quantization.compressed_tensors.compressed_tensors_moe import ( # noqa: E501
File "/vllm/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py", line 10, in <module>
from vllm.model_executor.layers.quantization.compressed_tensors.schemes import (
File "/vllm/vllm/model_executor/layers/quantization/compressed_tensors/schemes/__init__.py", line 4, in <module>
from .compressed_tensors_w8a8_fp8 import CompressedTensorsW8A8Fp8
File "/vllm/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py", line 10, in <module>
from vllm.model_executor.layers.quantization.utils.w8a8_utils import (
File "/vllm/vllm/model_executor/layers/quantization/utils/w8a8_utils.py", line 11, in <module>
TORCH_DEVICE_IDENTITY = torch.ones(1).cuda() if is_hip() else None
^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 372, in _lazy_init
torch._C._cuda_init()
RuntimeError: No HIP GPUs are available
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
(base) aiscuser@node-0:~/verl$ pip list
DEPRECATION: Loading egg at /opt/conda/lib/python3.11/site-packages/setuptools-78.0.2-py3.11.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
Package Version Editable project location
--------------------------------- -------------------------- -------------------------
accelerate 1.5.2
aiohappyeyeballs 2.6.1
aiohttp 3.11.14
aiohttp-cors 0.8.0
aiosignal 1.3.2
amdsmi 25.1.0+8dc45db
annotated-types 0.7.0
antlr4-python3-runtime 4.9.3
anyio 4.9.0
archspec 0.2.5
attrs 25.3.0
autocommand 2.2.2
awscli 1.38.19
backports.tarfile 1.2.0
boltons 24.0.0
boto3 1.37.19
botocore 1.37.19
Brotli 1.1.0
cachetools 5.5.2
certifi 2025.1.31
cffi 1.17.1
charset-normalizer 3.4.1
click 8.1.8
cloudpickle 3.1.1
codetiming 1.4.0
colorama 0.4.6
colorful 0.5.6
conda 25.1.1
conda-libmamba-solver 25.3.0
conda-package-handling 2.4.0
conda_package_streaming 0.11.0
datasets 3.4.1
dill 0.3.8
diskcache 5.6.3
distlib 0.3.9
distro 1.9.0
docker-pycreds 0.4.0
docutils 0.16
einops 0.8.1
fastapi 0.115.12
filelock 3.16.1
flash_attn 2.7.3
frozendict 2.4.6
frozenlist 1.5.0
fsspec 2024.10.0
gguf 0.10.0
gitdb 4.0.12
GitPython 3.1.44
google-api-core 2.24.2
google-auth 2.38.0
googleapis-common-protos 1.69.2
grpcio 1.71.0
h11 0.14.0
h2 4.2.0
hiredis 3.1.0
hpack 4.1.0
httpcore 1.0.7
httptools 0.6.4
httpx 0.28.1
huggingface-hub 0.29.3
hydra-core 1.3.2
hyperframe 6.1.0
idna 3.10
importlib_metadata 8.6.1
inflect 7.3.1
iniconfig 2.1.0
inquirerpy 0.3.4
interegular 0.3.3
jaraco.collections 5.1.0
jaraco.context 5.3.0
jaraco.functools 4.0.1
jaraco.text 3.12.1
Jinja2 3.1.4
jiter 0.9.0
jmespath 1.0.1
jsonpatch 1.33
jsonpointer 3.0.0
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
lark 1.2.2
libmambapy 2.0.8
libnacl 2.1.0
liger_kernel 0.5.5
llvmlite 0.44.0
lm-format-enforcer 0.10.6
MarkupSafe 2.1.5
menuinst 2.2.0
mistral_common 1.5.4
more-itertools 10.3.0
mpmath 1.3.0
msgpack 1.1.0
msgspec 0.19.0
multidict 6.2.0
multiprocess 0.70.16
nest-asyncio 1.6.0
networkx 3.4.2
numba 0.61.0
numpy 1.26.4
omegaconf 2.3.0
openai 1.68.2
opencensus 0.11.4
opencensus-context 0.1.3
opencv-python-headless 4.11.0.86
orjson 3.10.16
outlines 0.0.46
packaging 24.2
pandas 2.2.3
partial-json-parser 0.2.1.1.post5
peft 0.15.0
pfzy 0.3.4
pillow 11.0.0
pip 25.0.1
platformdirs 4.3.6
pluggy 1.5.0
prometheus_client 0.21.1
prometheus-fastapi-instrumentator 7.1.0
prompt_toolkit 3.0.50
propcache 0.3.0
proto-plus 1.26.1
protobuf 5.29.4
psutil 7.0.0
py-cpuinfo 9.0.0
py-spy 0.4.0
pyairports 2.1.1
pyarrow 19.0.1
pyasn1 0.6.1
pyasn1_modules 0.4.1
pybind11 2.13.6
pycosat 0.6.6
pycountry 24.6.1
pycparser 2.22
pydantic 2.10.6
pydantic_core 2.27.2
pylatexenc 2.10
PySocks 1.7.1
pytest 8.3.5
pytest-asyncio 0.26.0
python-dateutil 2.9.0.post0
python-dotenv 1.1.0
pytorch-triton-rocm 3.3.0+git96316ce5
pytz 2025.2
PyYAML 6.0.2
pyzmq 26.3.0
ray 2.44.0
redis 5.2.1
referencing 0.36.2
regex 2024.11.6
requests 2.32.3
rpds-py 0.23.1
rsa 4.7.2
ruamel.yaml 0.18.10
ruamel.yaml.clib 0.2.8
s3transfer 0.11.4
safetensors 0.5.3
scipy 1.15.2
sentencepiece 0.2.0
sentry-sdk 2.24.1
setproctitle 1.3.5
setuptools 75.8.2
setuptools 78.0.2
setuptools-scm 8.2.0
six 1.17.0
smart-open 7.1.0
smmap 5.0.2
sniffio 1.3.1
starlette 0.46.1
supervisor 4.2.5
sympy 1.13.3
tensorboardX 2.6.2.2
tensordict 0.6.2
tensorizer 2.9.2
tiktoken 0.9.0
tokenizers 0.21.1
tomli 2.0.1
torch 2.8.0.dev20250325+rocm6.3
torchaudio 2.6.0.dev20250325+rocm6.3
torchdata 0.11.0
torchvision 0.22.0.dev20250325+rocm6.3
tqdm 4.67.1
transformers 4.50.1
triton 3.2.0
truststore 0.10.1
typeguard 4.3.0
typing_extensions 4.12.2
tzdata 2025.2
urllib3 2.3.0
uvicorn 0.34.0
uvloop 0.21.0
verl 0.2.0.dev0 /verl
virtualenv 20.29.3
vllm 0.6.3+rocm634 /vllm
wandb 0.19.8
watchfiles 1.0.4
wcwidth 0.2.13
websockets 15.0.1
wheel 0.45.1
wrapt 1.17.2
xxhash 3.5.0
yarl 1.18.3
zipp 3.21.0
zstandard 0.23.0
(base) aiscuser@node-0:~/verl$ python
Python 3.11.11 | packaged by conda-forge | (main, Mar 3 2025, 20:43:55) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>> import vllm
WARNING 03-26 03:00:44 rocm.py:13] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead.
>>> torch.cuda.is_available()
True
>>> torch.ones(1).cuda()
tensor([1.], device='cuda:0')
>>>
(base) aiscuser@node-0:~/verl$