Why vllm cannot fully use GPU in batch processing

I’m running with vllm to process a large dataset for inference.

I have tried several parameters, and found that they do not work in my case. The GPU utilization is very low (10% utilization with nearly no GPU power increase)

self.llm = LLM(
                model=self.model_name,
                trust_remote_code=True,
                # max_model_len=4096,
                # max_num_seqs=512*10,
                # max_num_batched_tokens=512*10,
                # enable_chunked_prefill=True,
                # enable_prefix_caching=True,
            )
llm.chat(prompts, sampling_params) 

prompts can be very large (e.g., list from 1000 to 100000)

Thank you for your help!

maybe you could try:
self.llm.generate(prompts, sampling_params)

If your GPU is Nvidia, use this Python to check whether’s it’s available for VLLM to use.

import torch
if torch.cuda.is_available():
    print(f"GPU {torch.cuda.get_device_name(0)} looking good")
else:
    print("GPU is not available. Will use CPU.")

You may already know that, but because this is the first post about why vllm cannot use GPU, this answer can be here for folks to find.

Note: Your GPU being available as reported here is no guarantee anything will work. Only the converse is true (for Nvidia specifically wrt to above Python). “No GPU available” means VLLM will use your CPU exclusively. (unless i am wrong - i am very new at this)

Can you provide more information such as the model and GPU you’re using, and how many prompts you are processing in total?

Is there any difference between these? I see the doc said chat is more easy to use, and it calls generate

It use gpu, I can see it have some utilization, but not high as I expected

1 Like

I use A100 GPU, and about 10,000 prompts to process.

After some investigations I found if the batch size is too large it works not good. If I reduce it to reasonable size it can get better utilization. But still just about 30%

Oh yeah this is sort of a well-known issue. I’d suggest either of the following:

  1. Batching by yourself. This is a straightforward approach to make it work, but you’ll still see under utilization between batches, because in this case we have to 100% finish a batch before start processing the next batch.
for batch in dataset:
    outputs += llm.chat(batch, sampling_params)
  1. Continuous batching with async (advance):

    In this approach you could implement a wrapper of AsyncLLM, and use asyncio.create_task to asynchronously send prompts to the engine, and let the engine batch them. In this way you can use semaphore to control the maximum concurrent number of prompts in the engine. We implemented this approach in Ray Data LLM integration just FYI: ray/python/ray/llm/_internal/batch/stages/vllm_engine_stage.py at 7d240117d264097ac95013524429dd78c4a4712c · ray-project/ray · GitHub

Thank you for your suggestions. It seems reasonable. I have an additional question. It seems I can use a very large batch in the first method (absolutely exceeding the GPU
memory). How will vllm process this? It will also raise a queue and use continuous batching?

Yes vllm internally also has a queue for all requests, and uses continuous batching to batch requests, but it seems like the scalability isn’t that good…

Yeah, so may be a easy solution is to use vllm serve, and then another process to asyncio send request to this service? (somewhat like your second solution)

Right - Your question sounded like you know more about vllm than me (as everyone who knows anything about vllm does). I wrote my answer for search engines crawling for vllm+use+GPU.

Adding to my answer for other n00bs to come: cuda_12.8.x_570.x.x.run runfile installation takes unticked checkboxes to mean “UNinstall” rather than “Do not install” as the label says. If you re-run the cuda install runfile to, say, repair your cuda toolkit that you might have messed up somehow, and you already have the nvidia gpu driver installed, you cannot use runfile to update/reinstall cuda toolkit items unless you go to runlevel 3 and reinstall the nvidia driver too. Otherwise you end up with your cuda toolkit reinstalled, but your GPU driver got uninstalled!

Maybe I am old and an unticked “install” box changed meaning from “don’t install” to “UNinstall if it’s already present” and everybody knows this, but for any other old folks like me: “Don’t install” seems to mean “Uninstall” to Nvidia (at least), so watch out.