Why is inference for Qwen 2.5 VL so slow when we send an image?

Could you expand a bit more about “inference speed” here?

Image + text inference should generally be slower than text-only inference in terms of time-to-first-token due to the additional image processing and encoding in prefilling the request, but decoding performance should be identical.