I’m running with vllm to process a large dataset for inference.
I have tried several parameters, and found that they do not work in my case. The GPU utilization is very low (10% utilization with nearly no GPU power increase)
If your GPU is Nvidia, use this Python to check whether’s it’s available for VLLM to use.
import torch
if torch.cuda.is_available():
print(f"GPU {torch.cuda.get_device_name(0)} looking good")
else:
print("GPU is not available. Will use CPU.")
You may already know that, but because this is the first post about why vllm cannot use GPU, this answer can be here for folks to find.
Note: Your GPU being available as reported here is no guarantee anything will work. Only the converse is true (for Nvidia specifically wrt to above Python). “No GPU available” means VLLM will use your CPU exclusively. (unless i am wrong - i am very new at this)
After some investigations I found if the batch size is too large it works not good. If I reduce it to reasonable size it can get better utilization. But still just about 30%
Oh yeah this is sort of a well-known issue. I’d suggest either of the following:
Batching by yourself. This is a straightforward approach to make it work, but you’ll still see under utilization between batches, because in this case we have to 100% finish a batch before start processing the next batch.
for batch in dataset:
outputs += llm.chat(batch, sampling_params)
Thank you for your suggestions. It seems reasonable. I have an additional question. It seems I can use a very large batch in the first method (absolutely exceeding the GPU
memory). How will vllm process this? It will also raise a queue and use continuous batching?
Yes vllm internally also has a queue for all requests, and uses continuous batching to batch requests, but it seems like the scalability isn’t that good…
Yeah, so may be a easy solution is to use vllm serve, and then another process to asyncio send request to this service? (somewhat like your second solution)
Right - Your question sounded like you know more about vllm than me (as everyone who knows anything about vllm does). I wrote my answer for search engines crawling for vllm+use+GPU.
Adding to my answer for other n00bs to come: cuda_12.8.x_570.x.x.run runfile installation takes unticked checkboxes to mean “UNinstall” rather than “Do not install” as the label says. If you re-run the cuda install runfile to, say, repair your cuda toolkit that you might have messed up somehow, and you already have the nvidia gpu driver installed, you cannot use runfile to update/reinstall cuda toolkit items unless you go to runlevel 3 and reinstall the nvidia driver too. Otherwise you end up with your cuda toolkit reinstalled, but your GPU driver got uninstalled!
Maybe I am old and an unticked “install” box changed meaning from “don’t install” to “UNinstall if it’s already present” and everybody knows this, but for any other old folks like me: “Don’t install” seems to mean “Uninstall” to Nvidia (at least), so watch out.