Hi, I’m using verl framework to train Qwen3-30B-A3B, and my train task has trained for more than 200 steps, but at 223 step, we occured the ‘Memory usage increased after sleeping’ problem, It’s a little bit weird, I don’t know how to fix it, Is there anybody can give some clues to help me fix this problem? THANKS!
My enviroment is as follows:
hardware: Huawei Ascend NPU 910B, 8 nodes 8 NPUs, total 64 NPUs
CANN: 8.2.RC1
Python: 3.10.18
vllm: 0.9.1
vllm-ascend: 0.9.1rc3
torch: 2.5.1
torch-npu: 2.5.1.post1
Here is the error logs:
File “/cache/verl_algo/verl/single_controller/ray/base.py”, line 766, in func
return getattr(self.worker_dict[key], name)(*args, **kwargs)
File “/cache/verl_algo/verl/single_controller/base/decorator.py”, line 430, in inner
return func(*args, **kwargs)
File “/cache/verl_algo/verl/utils/profiler/mstx_profile.py”, line 210, in wrapper
return func(self, *args, **kwargs)
File “/cache/verl_algo/verl/workers/fsdp_workers.py”, line 762, in generate_sequences
with self.rollout_sharding_manager:
File “/cache/verl_algo/verl/utils/profiler/performance.py”, line 105, in f
return self.log(decorated_function, *args, **kwargs)
File “/cache/verl_algo/verl/utils/profiler/performance.py”, line 118, in log
output = func(*args, **kwargs)
File “/cache/verl_algo/verl/workers/sharding_manager/fsdp_vllm.py”, line 240, in exit
self.inference_engine.sleep(level=1)
File “/cache/verl_env/lib/python3.10/site-packages/vllm/entrypoints/llm.py”, line 1322, in sleep
self.llm_engine.sleep(level=level)
File “/cache/verl_env/lib/python3.10/site-packages/vllm/engine/llm_engine.py”, line 1860, in sleep
self.model_executor.sleep(level=level)
File “/cache/verl_env/lib/python3.10/site-packages/vllm/executor/executor_base.py”, line 207, in sleep
self.collective_rpc(“sleep”, kwargs=dict(level=level))
File “/cache/verl_env/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py”, line 57, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
File “/cache/verl_env/lib/python3.10/site-packages/vllm/utils.py”, line 2671, in run_method
return func(*args, **kwargs)
File “/cache/verl_env/lib/python3.10/site-packages/vllm_ascend/worker/worker.py”, line 202, in sleep
assert freed_bytes >= 0, “Memory usage increased after sleeping.”
AssertionError: Memory usage increased after sleeping.