I found a strange phenomenon. All 100 prompts can be inferred, but rank 1, 2, and 3 report the oom error when the inference ends. I don’t quite understand why there is an oom error after the inference is complete rather than during the inference.
The out-of-memory (OOM) error occurring after inference rather than during it could be due to the way memory is managed and released at the end of the process. According to vLLM’s troubleshooting guide, vLLM is designed to occupy all available GPU memory for storing KV cache blocks, which can lead to OOM errors if memory is not properly released or if there is a memory leak.
Additionally, the vLLM GitHub issues suggest that setting gpu_memory_utilization to a lower value can help manage memory usage more effectively. You might want to try reducing the gpu_memory_utilization parameter to see if it alleviates the OOM error. Also, ensure that any other processes using the GPU are minimized to free up memory.
Would you like more detailed guidance on adjusting these parameters or further troubleshooting steps?