Is Batch Inference for Multimodal Models Truly Batch Inference?

ahstarwab · April 29, 2025, 12:44am

I’m using a multimodal model (e.g., Qwen2.5 VL) and trying to perform batch inference for image captioning. When I run the offline batch inference example code, the decode phase (LLM) is properly batched, but the prefill phase (vision encoder) seems to process each sample sequentially in a for-loop.(vllm/vllm/entrypoints/llm.py at cfe45320931b7a1111ab7179a0ee473907b3f5ef · vllm-project/vllm · GitHub) It looks like the prefill phase takes up most of the inference time.
Was this implementation choice made because image sizes can vary between samples?
If all images within a batch have the same size, would it be safe to modify the code to perform batch inference for the prefill phase as well?

RunLLM · April 29, 2025, 12:44am

The current implementation processes the prefill phase sequentially, likely due to varying image sizes, which complicates batching. If all images in a batch have the same size, modifying the code for batch inference in the prefill phase could be feasible. However, ensure that the model and processing pipeline can handle batched inputs without issues.

Would you like more detail on how to modify the code for batch processing?

Sources:

DarkLight1337 · April 29, 2025, 2:20am

self._add_requestjust schedules the request to be run. The scheduler can choose to process more than one request at once.

RunLLM · April 29, 2025, 2:20am

Yes, self._add_request schedules a request to be processed, and the scheduler can handle multiple requests concurrently. The scheduling logic is designed to optimize resource utilization and throughput by potentially processing multiple requests in parallel, depending on the available resources and the scheduling policy in use.

Would you like more detail on how the scheduler manages multiple requests?

Sources:

DarkLight1337 · May 14, 2025, 9:22am

Correction: For LLM class, the preprocessing step is indeed processed sequentially before the requests are sent into the core engine for language generation. If you want to have them processed concurrently, then you should interact with AsyncLLM directly and add the requests in async.

RunLLM · May 14, 2025, 9:22am

Correct. With the LLM class, preprocessing (including vision encoder steps for multimodal models) is performed sequentially before batching for language generation. To process preprocessing steps concurrently, you should use AsyncLLMEngine or AsyncLLM, which are designed for asynchronous, concurrent request handling and can overlap preprocessing and generation phases for higher throughput.

Would you like a code example for using AsyncLLM with multimodal models?

Sources:

Topic		Replies	Views
Speeding up vllm inference for Qwen2.5-VL General	23	1118	June 27, 2025
Multimodal inference guideline? General	53	152	July 2, 2025
Help needed with vLLM Qwen3-32B inference performance tuning (long input slow) General	1	149	May 26, 2025
Scheduler in vllm Features	1	16	June 26, 2025
Why does computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity? Scheduling	3	12	June 2, 2025

Is Batch Inference for Multimodal Models Truly Batch Inference?

Related topics