Help in image from vllm pov

I am trying to use pil instead of base64 for local image processing. With base64 input tokens are high hence thinking of direct pil. when i checked the documentation, and updated to below

def _process_image_with_vllm(self, image_path: str, image_prompt: str) → str:
“”“Process image using VLLM vision model with ImageAsset”“”
try:
if not self.vllm_vision_client:
return f"VLLM vision client not available for {image_path}"

        vllm_config = self.config_loader.get("vllm")
        vision_config = vllm_config.get("vision", {}) if vllm_config else {}
        
        # Load image using VLLM's ImageAsset instead of base64 conversion
        try:
            image_asset = ImageAsset(image_path)
            pil_image = image_asset.pil_image
        except Exception as e:
            return f"Failed to load image using ImageAsset: {image_path}, Error: {str(e)}"
        
        # Create messages with PIL image content instead of base64
        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": image_prompt},
                    {"type": "image_pil", "image_pil": pil_image}
                ]
            }
        ]
        
        response = self.vllm_vision_client.chat(
            messages=messages,
            args={
                "temperature": vision_config.get("temperature", 0.0),
                "seed": vision_config.get("seed", 1234)
            }
        )
        
        if response:
            self.vllm_vision_usage_tracker.update_usage(response)
            self.vllm_overall_usage_tracker.update_usage(response)
            return response.get("response", f"VLLM vision processing failed for {image_path}")
        else:
            return f"VLLM vision processing failed for {image_path}"
            
    except Exception as e:
        print(f"Error calling VLLM vision model for {image_path}: {e}")
        return f"VLLM vision model processing failed for {image_path}: {str(e)}"

So what is right approach for local image processing excluding base64 conversion…

You are seeing the “404 Client Error” because ImageAsset in vLLM is not designed to load arbitrary local images; it only loads predefined demo/test images bundled with vLLM, not user-supplied file paths. That’s why it tries to fetch your image from a public S3 bucket and fails (vllm.assets.image.ImageAsset docs, image.py source).

For local image processing, you should load your image with PIL (e.g., PIL.Image.open(image_path)) and pass the PIL image object directly to the vLLM API as shown in the multimodal input documentation. However, you cannot send a PIL image object over HTTP (e.g., to the OpenAI-compatible API server), as it is not JSON serializable. For local/offline inference (Python API), pass the PIL image in the multi_modal_data dictionary. For API calls, you must use base64 or a URL (multimodal serving docs). There is no supported way to send a raw PIL image over HTTP.

Would you like a step-by-step code example for local inference with PIL images using the vLLM Python API?

Sources:

But according to the documentation only .generate method has that option, its not available for .chat method. what the variation between both? How much the outcome changes?

It seems likely that both .generate and .chat methods in vLLM support multimodal (e.g., PIL image) input, but the way you provide input differs. For .generate, you pass a dictionary with “prompt” and “multi_modal_data” (where “image” is a PIL image), as shown in the multimodal inputs documentation. For .chat, you provide a list of messages (OpenAI-style), and images can be included as PIL objects in the message content, e.g., {"type": "image_pil", "image_pil": pil_image} (same doc, see “If using the LLM.chat method…”).

The main difference is that .chat automatically applies the model’s chat template and is designed for multi-turn, conversational input, while .generate expects a pre-formatted prompt and is more “raw.” If you use the same model, prompt, and parameters, the outputs should be very similar, but .chat is recommended for chat/instruction-tuned models to ensure correct formatting (generative models doc). Would you like a code example for .chat with PIL images?

Sources:

yes give me example with .chat for pil images of local folder

Here is an example of using vLLM’s .chat method with a local image loaded via PIL, as described in the multimodal inputs documentation:

from vllm import LLM
from PIL import Image

llm = LLM(model="llava-hf/llava-1.5-7b-hf")
image = Image.open("/path/to/your/local/image.png")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is in this image?"},
            {"type": "image_pil", "image_pil": image}
        ]
    }
]

outputs = llm.chat(messages)
for o in outputs:
    print(o.outputs[0].text)

This approach works for local inference (Python API), not for HTTP API calls. The key is using {"type": "image_pil", "image_pil": image} in the message content.

Sources:

can u give me example output of that like in what format will using pil image would be to extract response, tokens, latency etc

When you use the .chat method with a PIL image as shown above, the output is a list of response objects. Each object contains the generated text, token information, and timing/latency details. For example, after running:

outputs = llm.chat(messages)
for o in outputs:
    print("Response:", o.outputs[0].text)
    print("Token IDs:", o.outputs[0].token_ids)
    print("Token Count:", len(o.outputs[0].token_ids))
    print("Latency (s):", o.latency)

The output will look like:

Response: The image shows a cat sitting on a windowsill.
Token IDs: [1, 234, 567, ...]
Token Count: 12
Latency (s): 0.432

The actual fields may vary by vLLM version and model, but this is the typical structure (multimodal inputs example).

Would you like a more detailed breakdown of the output object structure?

Sources:

def _process_image_with_vllm(self, image_path: str, image_prompt: str) → str:

“”“Process image using VLLM vision model with base64 encoding”“”

try:

if not self.vllm_vision_client:

return f"VLLM vision client not available for {image_path}"

vllm_config = self.config_loader.get(“vllm”)

vision_config = vllm_config.get(“vision”, {}) if vllm_config else {}

# Convert image to base64 for VLLM processing

base64_image = self._image_to_base64(image_path)

if not base64_image:

return f"Failed to convert image to base64: {image_path}"

# Create messages with base64 image content

messages = [

{

“role”: “user”,

“content”: [

{“type”: “text”, “text”: image_prompt},

{“type”: “image_url”, “image_url”: {“url”: base64_image}}

]

}

]

response = self.vllm_vision_client.chat(

messages=messages,

args={

“temperature”: vision_config.get(“temperature”, 0.0),

“seed”: vision_config.get(“seed”, 1234)

}

)

if response:

self.vllm_vision_usage_tracker.update_usage(response)

self.vllm_overall_usage_tracker.update_usage(response)

return response.get(“response”, f"VLLM vision processing failed for {image_path}")

else:

return f"VLLM vision processing failed for {image_path}"

except Exception as e:

print(f"Error calling VLLM vision model for {image_path}: {e}")

return f"VLLM vision model processing failed for {image_path}: {str(e)}"

can u convert this function to take pil from local images and maintaining original functionality of other parts

Certainly! To use a local PIL image with vLLM’s .chat method (for local Python inference), you should load the image with PIL and insert it into the message as {"type": "image_pil", "image_pil": pil_image}. The rest of your function can remain unchanged. Here’s the adapted function:

def _process_image_with_vllm(self, image_path: str, image_prompt: str) -> str:
    """Process image using VLLM vision model with PIL image input."""
    try:
        if not self.vllm_vision_client:
            return f"VLLM vision client not available for {image_path}"
        vllm_config = self.config_loader.get("vllm")
        vision_config = vllm_config.get("vision", {}) if vllm_config else {}

        # Load image as PIL
        from PIL import Image
        try:
            pil_image = Image.open(image_path).convert("RGB")
        except Exception as e:
            return f"Failed to load image as PIL: {image_path}, Error: {str(e)}"

        # Create messages with PIL image content
        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": image_prompt},
                    {"type": "image_pil", "image_pil": pil_image}
                ]
            }
        ]

        response = self.vllm_vision_client.chat(
            messages=messages,
            args={
                "temperature": vision_config.get("temperature", 0.0),
                "seed": vision_config.get("seed", 1234)
            }
        )

        if response:
            self.vllm_vision_usage_tracker.update_usage(response)
            self.vllm_overall_usage_tracker.update_usage(response)
            return response.get("response", f"VLLM vision processing failed for {image_path}")
        else:
            return f"VLLM vision processing failed for {image_path}"

    except Exception as e:
        print(f"Error calling VLLM vision model for {image_path}: {e}")
        return f"VLLM vision model processing failed for {image_path}: {str(e)}"

This approach is supported for local Python inference, as described in the multimodal inputs documentation.

Sources:

but u said When you use the .chat method with a PIL image as shown above, the output is a list of response objects. Each object contains the generated text, token information, and timing/latency details. For example, after running:

then y the code is like this ?

Great question! The code I provided is for a typical vLLM Python API usage, where calling .chat(messages) returns a list of response objects (e.g., outputs). Each object in this list (e.g., o in for o in outputs:) contains fields like outputs[0].text, outputs[0].token_ids, and latency (see example).

However, in your code, you are calling self.vllm_vision_client.chat(...) and then immediately doing response.get("response", ...). This suggests your vllm_vision_client is a wrapper or custom client that may not return the raw vLLM response objects, but instead a dictionary (possibly with just the text). If you want access to the full vLLM response object (with tokens, latency, etc.), you should use the standard vLLM Python API and inspect the returned objects directly, as in:

outputs = llm.chat(messages)
for o in outputs:
    print(o.outputs[0].text)
    print(o.outputs[0].token_ids)
    print(o.latency)

If your client wraps or post-processes the output, you may lose access to these details. Would you like help adapting your client to expose the full vLLM response object?

Sources:

true what u said problem is solved for me

Your current chat method sends requests to a vLLM OpenAI-compatible HTTP API server. The HTTP API does not support sending raw PIL images—images must be sent as base64-encoded strings or URLs (serving docs). PIL images are only supported in the local Python API, not over HTTP.

To use local images with your HTTP API client:

  1. Load the image with PIL.
  2. Encode it as base64.
  3. Send it in the payload as {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<...>"}}.

Example:

import base64
from PIL import Image
import io

def pil_image_to_base64(image_path):
    with Image.open(image_path) as img:
        buffered = io.BytesIO()
        img.convert("RGB").save(buffered, format="JPEG")
        return "data:image/jpeg;base64," + base64.b64encode(buffered.getvalue()).decode()

# Usage in your chat client:
image_path = "/path/to/local/image.jpg"
base64_image = pil_image_to_base64(image_path)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is in this image?"},
            {"type": "image_url", "image_url": {"url": base64_image}}
        ]
    }
]
result = your_client.chat(messages)
print(result)

Summary:

  • For HTTP API, always use base64 or URL for images.
  • PIL objects are only for local (Python) API, not HTTP.

Would you like a full example for the local (Python) API as well?

Sources:

can u give me sources regarding this like working links

Certainly! Here are the official vLLM documentation sources that explain image input requirements for both the Python API and HTTP API:

  • For the Python API (local inference), you can pass a PIL image object directly in the multi_modal_data dictionary or as image_pil in .chat messages:
    Multimodal Inputs (Python API)

  • For the HTTP/OpenAI-compatible API server, you must use a base64-encoded string or a URL for images—PIL objects are not supported over HTTP:
    Multimodal Inputs (Serving/API)

These links provide code examples and clarify the supported formats for each interface.

Sources: