Why vllm cannot fully use GPU in batch processing

bean · March 28, 2025, 10:36am

I’m running with vllm to process a large dataset for inference.

I have tried several parameters, and found that they do not work in my case. The GPU utilization is very low (10% utilization with nearly no GPU power increase)

self.llm = LLM(
                model=self.model_name,
                trust_remote_code=True,
                # max_model_len=4096,
                # max_num_seqs=512*10,
                # max_num_batched_tokens=512*10,
                # enable_chunked_prefill=True,
                # enable_prefix_caching=True,
            )
llm.chat(prompts, sampling_params)

prompts can be very large (e.g., list from 1000 to 100000)

Thank you for your help!

dingfen · March 28, 2025, 12:58pm

maybe you could try:
self.llm.generate(prompts, sampling_params)

chris · March 28, 2025, 2:40pm

If your GPU is Nvidia, use this Python to check whether’s it’s available for VLLM to use.

import torch
if torch.cuda.is_available():
    print(f"GPU {torch.cuda.get_device_name(0)} looking good")
else:
    print("GPU is not available. Will use CPU.")

You may already know that, but because this is the first post about why vllm cannot use GPU, this answer can be here for folks to find.

Note: Your GPU being available as reported here is no guarantee anything will work. Only the converse is true (for Nvidia specifically wrt to above Python). “No GPU available” means VLLM will use your CPU exclusively. (unless i am wrong - i am very new at this)

comaniac · March 28, 2025, 3:46pm

Can you provide more information such as the model and GPU you’re using, and how many prompts you are processing in total?

bean · March 28, 2025, 4:06pm

Is there any difference between these? I see the doc said chat is more easy to use, and it calls generate

bean · March 28, 2025, 4:07pm

It use gpu, I can see it have some utilization, but not high as I expected

bean · March 28, 2025, 4:08pm

I use A100 GPU, and about 10,000 prompts to process.

bean · March 28, 2025, 4:09pm

After some investigations I found if the batch size is too large it works not good. If I reduce it to reasonable size it can get better utilization. But still just about 30%

comaniac · March 28, 2025, 4:29pm

Oh yeah this is sort of a well-known issue. I’d suggest either of the following:

Batching by yourself. This is a straightforward approach to make it work, but you’ll still see under utilization between batches, because in this case we have to 100% finish a batch before start processing the next batch.

for batch in dataset:
    outputs += llm.chat(batch, sampling_params)

Continuous batching with async (advance):

In this approach you could implement a wrapper of AsyncLLM, and use asyncio.create_task to asynchronously send prompts to the engine, and let the engine batch them. In this way you can use semaphore to control the maximum concurrent number of prompts in the engine. We implemented this approach in Ray Data LLM integration just FYI: ray/python/ray/llm/_internal/batch/stages/vllm_engine_stage.py at 7d240117d264097ac95013524429dd78c4a4712c · ray-project/ray · GitHub

bean · March 29, 2025, 3:08am

Thank you for your suggestions. It seems reasonable. I have an additional question. It seems I can use a very large batch in the first method (absolutely exceeding the GPU
memory). How will vllm process this? It will also raise a queue and use continuous batching?

comaniac · March 29, 2025, 3:38am

Yes vllm internally also has a queue for all requests, and uses continuous batching to batch requests, but it seems like the scalability isn’t that good…

bean · March 29, 2025, 4:18am

Yeah, so may be a easy solution is to use vllm serve, and then another process to asyncio send request to this service? (somewhat like your second solution)

chris · March 29, 2025, 12:59pm

Right - Your question sounded like you know more about vllm than me (as everyone who knows anything about vllm does). I wrote my answer for search engines crawling for vllm+use+GPU.

Adding to my answer for other n00bs to come: cuda_12.8.x_570.x.x.run runfile installation takes unticked checkboxes to mean “UNinstall” rather than “Do not install” as the label says. If you re-run the cuda install runfile to, say, repair your cuda toolkit that you might have messed up somehow, and you already have the nvidia gpu driver installed, you cannot use runfile to update/reinstall cuda toolkit items unless you go to runlevel 3 and reinstall the nvidia driver too. Otherwise you end up with your cuda toolkit reinstalled, but your GPU driver got uninstalled!

Maybe I am old and an unticked “install” box changed meaning from “don’t install” to “UNinstall if it’s already present” and everybody knows this, but for any other old folks like me: “Don’t install” seems to mean “Uninstall” to Nvidia (at least), so watch out.

Topic		Replies	Views
Low Average GPU Utilization (40–70%) on H100 with vLLM — How to Push Toward 90%+? General	1	303	January 31, 2026
Batched processing over openai api General	3	86	November 30, 2025
VLLM 0.10.1 benchmark do not free memory Benchmarking	13	187	November 10, 2025
When running the benchmark with vllm 0.10.1 V1, the GPU usage was very high General	1	154	November 5, 2025
torch.OutOfMemoryError: CUDA out of memory General	15	1134	March 3, 2026

Why vllm cannot fully use GPU in batch processing

Related topics