Could you expand a bit more about “inference speed” here?
Image + text inference should generally be slower than text-only inference in terms of time-to-first-token due to the additional image processing and encoding in prefilling the request, but decoding performance should be identical.