Is Batch Inference for Multimodal Models Truly Batch Inference?

I’m using a multimodal model (e.g., Qwen2.5 VL) and trying to perform batch inference for image captioning. When I run the offline batch inference example code, the decode phase (LLM) is properly batched, but the prefill phase (vision encoder) seems to process each sample sequentially in a for-loop.(vllm/vllm/entrypoints/llm.py at cfe45320931b7a1111ab7179a0ee473907b3f5ef · vllm-project/vllm · GitHub) It looks like the prefill phase takes up most of the inference time.
Was this implementation choice made because image sizes can vary between samples?
If all images within a batch have the same size, would it be safe to modify the code to perform batch inference for the prefill phase as well?

The current implementation processes the prefill phase sequentially, likely due to varying image sizes, which complicates batching. If all images in a batch have the same size, modifying the code for batch inference in the prefill phase could be feasible. However, ensure that the model and processing pipeline can handle batched inputs without issues.

Would you like more detail on how to modify the code for batch processing?

Sources:

self._add_requestjust schedules the request to be run. The scheduler can choose to process more than one request at once.

Yes, self._add_request schedules a request to be processed, and the scheduler can handle multiple requests concurrently. The scheduling logic is designed to optimize resource utilization and throughput by potentially processing multiple requests in parallel, depending on the available resources and the scheduling policy in use.

Would you like more detail on how the scheduler manages multiple requests?

Sources: