Multimodal inference guideline?

brechtBDCK · June 27, 2025, 7:24pm

this code goes wrong still:

def predict(self, image, question: str) -> str:
    """Generate an answer for the given image and question."""
    prompt = self.build_prompt(question)
    if isinstance(prompt, list):
        prompt = prompt[0]  # or ''.join(prompt)

    outputs = self.llm.generate(
        {"prompt": prompt, "multi_modal_data": {"image": image}},
        sampling_params=self.sampling_params,
    )

RunLLM · June 27, 2025, 7:25pm

Your code is almost correct, but the error likely comes from self.build_prompt(question) returning a list of dictionaries (e.g., chat template messages), not a string. For models like LLaVA-OneVision, you must use the tokenizer’s apply_chat_template to convert the message list to a string prompt (see example).

Correct approach:

# Suppose self.tokenizer is your model's tokenizer
messages = self.build_prompt(question)  # returns a list of dicts
prompt = self.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

outputs = self.llm.generate(
    {"prompt": prompt, "multi_modal_data": {"image": image}},
    sampling_params=self.sampling_params,
)

Do not pass a list directly as the prompt—always convert it to a string with the tokenizer.

Would you like a full code example for your model?

Sources:

brechtBDCK · July 1, 2025, 12:40pm

Can you in simple terms explain what max_model_len and max_num_seqs do? What do they represent in the models?

RunLLM · July 1, 2025, 12:40pm

max_model_len sets the maximum number of tokens (input + output) a single request can use—it’s the model’s context window.
max_num_seqs sets the maximum number of requests (sequences) that can be processed in parallel (i.e., the batch size for one iteration).

Sources:

brechtBDCK · July 2, 2025, 6:55am

what is the correct way to inference the minicpm models (both V, O)? I get the following error

Failed to evaluate openbmb/MiniCPM-Llama3-V-2_5: Expected there to be 1 prompt updates corresponding to 1 image items, but instead found 0 prompt updates! This is likely because you forgot to include input placeholder tokens (e.g., <image>, <|image_pad|>) in the prompt. If the model has a chat template, make sure you have applied it before calling LLM.generate.

RunLLM · July 2, 2025, 6:56am

You must use the correct prompt template with image placeholders and apply the chat template before calling LLM.generate. For MiniCPM-V and MiniCPM-O, use the tokenizer’s apply_chat_template and include (<image>./</image>) in the user message content (vision_language.py example, multimodal inputs doc).

Example:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("openbmb/MiniCPM-Llama3-V-2_5", trust_remote_code=True)
messages = [{"role": "user", "content": "(<image>./</image>)\nWhat is in this image?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = llm.generate({"prompt": prompt, "multi_modal_data": {"image": image}})

Would you like a full example for MiniCPM-O as well?

Sources:

brechtBDCK · July 2, 2025, 7:04am

I do the following, which still returns the sam error:

def _prompt_minicpm(self, question: str) → str:
“”“Prompt format for MiniCPM models.”“”
messages = {“role”: “user”, “content”: f"(./)\n{question}"}
return self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

and

def predict(self, image, question: str) -> str:
    """Generate an answer for the given image and question."""
    prompt = self.build_prompt(question)
    outputs = self.llm.generate(
        {"prompt": prompt, 
         "multi_modal_data": {"image": image}},
        sampling_params=self.sampling_params,
    )
    return outputs[0].outputs[0].text.strip()

RunLLM · July 2, 2025, 7:05am

Your error persists because you are not using the correct image placeholder for MiniCPM models. You must use (<image>./</image>) (not just (./)) in the prompt, and you must pass a list of messages (not a single dict) to apply_chat_template (vision_language.py example).

Corrected code:

def _prompt_minicpm(self, question: str) -> str:
    messages = [{
        "role": "user",
        "content": f"(<image>./</image>)\n{question}"
    }]
    return self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

This will resolve the “Expected there to be 1 prompt updates corresponding to 1 image items” error.

Would you like a full end-to-end example?

Sources:

vision_language.py example

brechtBDCK · July 2, 2025, 7:07am

NFO 07-02 07:06:12 [worker.py:294] Memory profiling takes 8.16 seconds
INFO 07-02 07:06:12 [worker.py:294] the current vLLM instance can use total_gpu_memory (23.64GiB) x gpu_memory_utilization (0.95) = 22.46GiB
INFO 07-02 07:06:12 [worker.py:294] model weights take 15.95GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 0.69GiB; the rest of the memory reserved for KV Cache is 5.76GiB.
INFO 07-02 07:06:13 [executor_base.py:113] # cuda blocks: 2948, # CPU blocks: 2048
INFO 07-02 07:06:13 [executor_base.py:118] Maximum concurrency for 4096 tokens per request: 11.52x
INFO 07-02 07:06:14 [llm_engine.py:428] init engine (profile, create kv cache, warmup model) took 10.39 seconds
Adding requests: 0%| | 0/1 [00:04<?, ?it/s]
Failed to evaluate openbmb/MiniCPM-Llama3-V-2_5: Expected there to be 1 prompt updates corresponding to 1 image items, but instead found 0 prompt updates! This is likely because you forgot to include input placeholder tokens (e.g., <image>, <|image_pad|>) in the prompt. If the model has a chat template, make sure you have applied it before calling LLM.generate.

“”"Utilities for running inference with various VLM models.

This module exposes the :class:VLMInference class which can be used with
different multimodal model families. The class automatically selects the
correct prompting scheme based on the provided model name or an explicit
family argument.
“”"
import os

os.environ[“CUDA_VISIBLE_DEVICES”] = “0,1”
#os.environ[“VLLM_USE_V1”] = “1” # Ensure VLLM uses the correct version: set to zero for blip, Gemma, internVL, Ovis

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

class VLMInference:
“”“Run inference for different multimodal model families.”“”

def __init__(self, model_name: str, family: str | None = None) -> None:
    self.model_name = model_name
    # Infer the family if not explicitly provided
    self.family = family or self._detect_family(model_name)

    # Set up the LLM and tokenizer once
    self.llm = LLM(
        model=model_name,
        gpu_memory_utilization=0.95,
        enforce_eager=True,
        #max_model_len=1024*4,
        max_num_seqs= 1,
        limit_mm_per_prompt= {"image": 1}, #disable for gemma
        trust_remote_code=True,
        tensor_parallel_size=1,  #add to two for prometheus
        # dtype="float16", 
    )
    
    self.tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    
    # Conservative generation parameters for deterministic output
    self.sampling_params = SamplingParams(
        temperature=0.0,
        max_tokens=500,
        top_p=1.0,
        top_k=1,
    )

# ------------------------------------------------------------------
# Prompt helpers
# ------------------------------------------------------------------
def _prompt_internvl(self, question: str) -> str:
    """Prompt format for InternVL models."""
    messages = [{"role": "user", "content": f"<image>\n{question}"}]
    return self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

def _prompt_gemma(self, question: str) -> str:
    """Prompt format for Gemma models."""
    return (
        "<bos><start_of_turn>user\n"
        f"<start_of_image>{question}<end_of_turn>\n"
        "<start_of_turn>model\n"
    )

def _prompt_minicpm(self, question: str) -> str:
    messages = [{
        "role": "user",
        "content": f"(<image>./</image>)\n{question}"
    }]
    return self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)


def _prompt_blip(self, question: str) -> str:
    """Prompt format for BLIP/BLIP2 models."""
    return f"Question: {question} Answer:"

def _prompt_ovis(self, question: str) -> str:
    """Prompt format for Ovis models."""
    return f"USER: <image>\n{question}\nASSISTANT:"

def _prompt_llama(self, question: str) -> str:
    """Prompt format for Llama-based models (LLaVA, etc.)."""
    return f"USER: <image>\n{question}\nASSISTANT:"

def _prompt_llava(self, question: str) -> str:
    """Prompt format for Llava models."""
    return f"USER: <image>\n{question}\nASSISTANT:"

_PROMPT_FUNCS = {
    "internvl": _prompt_internvl,
    "gemma": _prompt_gemma,
    "minicpm": _prompt_minicpm,
    "blip2": _prompt_blip,
    "ovis": _prompt_ovis,
    "llama": _prompt_llama,
    "llava": _prompt_llava,
}

def _detect_family(self, model_name: str) -> str:
    """Best-effort detection of the model family from its name."""
    name = model_name.lower()
    if "internvl" in name:  
        print(f"Detected family: internvl for model {model_name}")
        return "internvl"
    if "llama" in name:
        print(f"Detected family: llama for model {model_name}")
        return "llama"
    if "llava" in name:
        print(f"Detected family: llava for model {model_name}")
        return "llava"
    if "gemma" in name:
        print(f"Detected family: gemma for model {model_name}")
        return "gemma"
    if "minicpm" in name:
        print(f"Detected family: minicpm for model {model_name}")
        return "minicpm"
    if "blip2" in name:
        print(f"Detected family: blip2 for model {model_name}")
        return "blip2"
    if "ovis" in name:
        print(f"Detected family: ovis for model {model_name}")
        return "ovis"
   
    return "internvl"

def build_prompt(self, question: str) -> str:
    """Return the correct prompt for question."""
    func = self._PROMPT_FUNCS.get(self.family)
    if func is None:
        raise ValueError(f"No prompt function found for model family: {self.family}")
    return func(self, question)

def predict(self, image, question: str) -> str:
    """Generate an answer for the given image and question."""
    prompt = self.build_prompt(question)
    outputs = self.llm.generate(
        {"prompt": prompt, 
         "multi_modal_data": {"image": image}},
        sampling_params=self.sampling_params,
    )
    return outputs[0].outputs[0].text.strip()

RunLLM · July 2, 2025, 7:08am

Your code for MiniCPM is almost correct, but the error persists because the placeholder must be exactly (<image>./</image>) (no extra whitespace or line breaks), and the prompt must be passed as a string after applying the chat template to a list of messages. Also, ensure the image is passed as a PIL Image, not a tensor.

Correct MiniCPM prompt function:

def _prompt_minicpm(self, question: str) -> str:
    messages = [{
        "role": "user",
        "content": f"(<image>./</image>)\n{question}"
    }]
    return self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

Key points:

The placeholder must be (<image>./</image>) (no extra line breaks or spaces).
The message list must be passed to apply_chat_template.
The image must be a PIL Image object.

This will resolve the “Expected there to be 1 prompt updates corresponding to 1 image items, but instead found 0 prompt updates!” error (vision_language.py example).

Would you like a full end-to-end example for MiniCPM inference?

Sources:

brechtBDCK · July 2, 2025, 7:13am

I think it has to do with the stop token id’s? how do i incorporate those in my code?

RunLLM · July 2, 2025, 7:13am

Yes, for some models (including MiniCPM), you must explicitly set the correct stop_token_ids in your generation call to ensure proper stopping (vision_language.py example). You can get the stop token IDs from the tokenizer:

stop_tokens = ["<|im_end|>", "<|endoftext|>"]  # Example for MiniCPM
stop_token_ids = [self.tokenizer.convert_tokens_to_ids(tok) for tok in stop_tokens]

outputs = self.llm.generate(
    {
        "prompt": prompt,
        "multi_modal_data": {"image": image}
    },
    sampling_params=self.sampling_params,
    stop_token_ids=stop_token_ids,
)

This ensures generation stops at the correct tokens. Would you like a full code example?

Sources:

vision_language.py example

brechtBDCK · July 2, 2025, 7:30am

Whys is ti still not working!!!
The image type is <class ‘PIL.Image.Image’>
Failed to evaluate openbmb/MiniCPM-Llama3-V-2_5: LLM.generate() got an unexpected keyword argument ‘stop_token_ids’

CODE:

"""Utilities for running inference with various VLM models.

This module exposes the :class:`VLMInference` class which can be used with
different multimodal model families. The class automatically selects the
correct prompting scheme based on the provided model name or an explicit
``family`` argument.
"""
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
#os.environ["VLLM_USE_V1"] = "1"  # Ensure VLLM uses the correct version: set to zero for blip, Gemma, internVL, Ovis

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

class VLMInference:
    """Run inference for different multimodal model families."""

    def __init__(self, model_name: str, family: str | None = None) -> None:
        self.model_name = model_name
        # Infer the family if not explicitly provided
        self.family = family or self._detect_family(model_name)

        # Set up the LLM and tokenizer once
        self.llm = LLM(
            model=model_name,
            gpu_memory_utilization=0.95,
            enforce_eager=True,
            #max_model_len=1024*4,
            max_num_seqs= 1,
            limit_mm_per_prompt= {"image": 1}, #disable for gemma
            trust_remote_code=True,
            tensor_parallel_size=1,  #add to two for prometheus
            # dtype="float16", 
        )
        print("initializing tokenizer for model:", model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
        
        # Conservative generation parameters for deterministic output
        self.sampling_params = SamplingParams(
            temperature=0.0,
            max_tokens=500,
            top_p=1.0,
            top_k=1,
        )

    # ------------------------------------------------------------------
    # Prompt helpers
    # ------------------------------------------------------------------
    def _prompt_internvl(self, question: str) -> str:
        """Prompt format for InternVL models."""
        messages = [{"role": "user", "content": f"<image>\n{question}"}]
        return self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    def _prompt_gemma(self, question: str) -> str:
        """Prompt format for Gemma models."""
        return (
            "<bos><start_of_turn>user\n"
            f"<start_of_image>{question}<end_of_turn>\n"
            "<start_of_turn>model\n"
        )

    def _prompt_minicpm(self, question: str) -> str:
        messages = [{
            "role": "user",
            "content": f"(<image>./</image>)\n{question}"
        }]
        return self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)


    def _prompt_blip(self, question: str) -> str:
        """Prompt format for BLIP/BLIP2 models."""
        return f"Question: {question} Answer:"

    def _prompt_ovis(self, question: str) -> str:
        """Prompt format for Ovis models."""
        return f"USER: <image>\n{question}\nASSISTANT:"

    def _prompt_llama(self, question: str) -> str:
        """Prompt format for Llama-based models (LLaVA, etc.)."""
        return f"USER: <image>\n{question}\nASSISTANT:"
    
    def _prompt_llava(self, question: str) -> str:
        """Prompt format for Llava models."""
        return f"USER: <image>\n{question}\nASSISTANT:"
    
    _PROMPT_FUNCS = {
        "internvl": _prompt_internvl,
        "gemma": _prompt_gemma,
        "minicpm": _prompt_minicpm,
        "blip2": _prompt_blip,
        "ovis": _prompt_ovis,
        "llama": _prompt_llama,
        "llava": _prompt_llava,
    }

    def _detect_family(self, model_name: str) -> str:
        """Best-effort detection of the model family from its name."""
        name = model_name.lower()
        if "internvl" in name:  
            print(f"Detected family: internvl for model {model_name}")
            return "internvl"
        if "minicpm" in name:
            print(f"Detected family: minicpm for model {model_name}")
            return "minicpm"
        if "llama" in name:
            print(f"Detected family: llama for model {model_name}")
            return "llama"
        if "llava" in name:
            print(f"Detected family: llava for model {model_name}")
            return "llava"
        if "gemma" in name:
            print(f"Detected family: gemma for model {model_name}")
            return "gemma"
        
        if "blip2" in name:
            print(f"Detected family: blip2 for model {model_name}")
            return "blip2"
        if "ovis" in name:
            print(f"Detected family: ovis for model {model_name}")
            return "ovis"
       
        return "internvl"

    def build_prompt(self, question: str) -> str:
        """Return the correct prompt for question."""
        func = self._PROMPT_FUNCS.get(self.family)
        if func is None:
            raise ValueError(f"No prompt function found for model family: {self.family}")
        return func(self, question)

    def predict(self, image, question: str) -> str:
        """Generate an answer for the given image and question."""
        print("The image type is", type(image))
        prompt = self.build_prompt(question)
        
        stop_tokens = ["<|im_end|>", "<|endoftext|>"]  # Example for MiniCPM
        stop_token_ids = [self.tokenizer.convert_tokens_to_ids(tok) for tok in stop_tokens]

        
        outputs = self.llm.generate(
            {"prompt": prompt, 
             "multi_modal_data": {"image": image}},
            sampling_params=self.sampling_params,
            stop_token_ids=stop_token_ids,

        )
        return outputs[0].outputs[0].text.strip()

RunLLM · July 2, 2025, 7:30am

The error occurs because stop_token_ids is not a valid argument to LLM.generate()—it must be set inside the SamplingParams object, not as a separate argument (vision_language.py example, SamplingParams docs).

How to fix:

Move stop_token_ids into your SamplingParams:

stop_tokens = ["<|im_end|>", "<|endoftext|>"]
stop_token_ids = [self.tokenizer.convert_tokens_to_ids(tok) for tok in stop_tokens]
sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=500,
    top_p=1.0,
    top_k=1,
    stop_token_ids=stop_token_ids,  # <-- put it here
)
outputs = self.llm.generate(
    {"prompt": prompt, "multi_modal_data": {"image": image}},
    sampling_params=sampling_params,
)

Do not pass stop_token_ids directly to generate().

Would you like a full working example for MiniCPM?

Sources:

vision_language.py example

brechtBDCK · August 6, 2025, 1:09pm

it works with 2.6, what am i doing wrong with the end stop tokens?

Failed to evaluate openbmb/MiniCPM-Llama3-V-2_5: Expected there to be 1 prompt updates corresponding to 1 image items, but instead found 0 prompt updates! This is likely because you forgot to include input placeholder tokens (e.g., <image>, <|image_pad|>) in the prompt. If the model has a chat template, make sure you have applied it before calling LLM.generate.

self.llm = LLM(
    model=model_name,
    gpu_memory_utilization=0.9,
    enforce_eager=True,
    max_model_len=2500, #1024,
    max_num_seqs= 1,
    limit_mm_per_prompt= {"image": 1}, #disable for gemma
    trust_remote_code=True,
    tensor_parallel_size=1,  #add to two for prometheus
    #dtype="float32", #use float32 for deepseek
)

self.tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
#self.stop_token_ids = \[self.tokenizer.eos_id\]
self.stop_token_ids = \[self.tokenizer.eos_id, self.tokenizer.eot_id\]        #added
#self.stop_tokens = \["<|im_end|>", "<|endoftext|>"\]
#self.stop_token_ids = \[self.tokenizer.convert_tokens_to_ids(tok) for tok in self.stop_tokens\]

# self.stop_token_ids = \[self.tokenizer.convert_tokens_to_ids(i) for i in self.stop_tokens\]
# Conservative generation parameters for deterministic output
self.sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=500,
    top_p=1.0,
    top_k=1,
    stop_token_ids = self.stop_token_ids, # added
)

def _prompt_minicpm(self, question: str) → str:
messages = [{
“role”: “user”,
“content”: f"(./)\n{question}"
}]
return self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

def predict(self, image, question: str) → str:
“”“Generate an answer for the given image and question.”“”
prompt = self.build_prompt(question)
outputs = self.llm.generate(
{“prompt”: prompt,
“multi_modal_data”: {“image”: image}},
sampling_params=self.sampling_params,
)
return outputs[0].outputs[0].text.strip()

RunLLM · August 6, 2025, 1:10pm

Your error is because your prompt is missing the required image placeholder for MiniCPM. You wrote (./), but it must be exactly (<image>./</image>) in the prompt string (vision_language.py example). Also, make sure you pass a list of messages to apply_chat_template.

Corrected code:

def _prompt_minicpm(self, question: str) -> str:
    messages = [{
        "role": "user",
        "content": f"(<image>./</image>)\n{question}"
    }]
    return self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

This will resolve the “Expected there to be 1 prompt updates corresponding to 1 image items, but instead found 0 prompt updates!” error.

Would you like a full end-to-end example?

Sources:

vision_language.py example

brechtBDCK · August 6, 2025, 1:24pm

It works for V2.6, so it seems to be with the stop tokens?

RunLLM · August 6, 2025, 1:25pm

If your code works for MiniCPM-V-2_6 but not for MiniCPM-Llama3-V-2_5, the issue is likely with the stop tokens. MiniCPM-Llama3-V-2_5 uses both eos_id and eot_id as stop tokens, while V-2_6 uses only eos_id (see example). Make sure you set stop_token_ids=[tokenizer.eos_id, tokenizer.eot_id] in your SamplingParams for V-2_5.

Would you like a code snippet showing this for both models?

Sources:

vision_language.py example

brechtBDCK · August 6, 2025, 1:26pm

brechtBDCK:

self.tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
#self.stop_token_ids = \[self.tokenizer.eos_id\]
self.stop_token_ids = \[self.tokenizer.eos_id, self.tokenizer.eot_id\]        #added
#self.stop_tokens = \["<|im_end|>", "<|endoftext|>"\]
#self.stop_token_ids = \[self.tokenizer.convert_tokens_to_ids(tok) for tok in self.stop_tokens\]

I am doing that!

self.tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) self.stop_token_ids = [self.tokenizer.eos_id, self.tokenizer.eot_id] #added

So what can i be doing wrong still

RunLLM · August 6, 2025, 1:27pm

If you are already setting self.stop_token_ids = [self.tokenizer.eos_id, self.tokenizer.eot_id] and still get the “Expected there to be 1 prompt updates corresponding to 1 image items, but instead found 0 prompt updates!” error, the problem is not with the stop tokens. This error is caused by the prompt missing the required image placeholder.

Checklist:

Your prompt for MiniCPM-Llama3-V-2_5 must contain (<image>./</image>) exactly (not (./) or any other variant).
You must use the tokenizer’s apply_chat_template on a list of messages, not a single dict or string.
The image you pass must be a valid PIL Image.

Example:

messages = [{
    "role": "user",
    "content": f"(<image>./</image>)\nWhat is in this image?"
}]
prompt = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = self.llm.generate(
    {"prompt": prompt, "multi_modal_data": {"image": image}},
    sampling_params=SamplingParams(
        temperature=0.0,
        max_tokens=500,
        stop_token_ids=[self.tokenizer.eos_id, self.tokenizer.eot_id],
    ),
)

If you do all of the above and still get the error, double-check for typos, extra whitespace, or incorrect message formatting.

Would you like a minimal, copy-paste-ready working example?

Sources:

Topic		Replies	Views
Speeding up vllm inference for Qwen2.5-VL General	23	4373	June 27, 2025
what is the recommended method to start up the vllm server engine for inferencing for InternVL3_5-8B, getting 2 qps? General	1	16	September 14, 2025
Make install easier General	11	165	July 24, 2025
Question about InternVL implementation General	22	66	September 9, 2025
Prompt_embeds usage in vllm openai completion api Multi-modality	4	61	June 17, 2025

Multimodal inference guideline?

Related topics