Multimodal inference guideline?

Hi all.

Firsttime user, so i have some questions about multimodal inference. Below you can find some code of how i run it currently.

  • As my card has 12 GB VRAM, shouldnt i be able to run eg a 7B model? I often get that there is not enough memory for KV-cache. Is there a guideline on which models require which amount of VRAM? Are there tricks to lower this amount?
  • As i understand, the template of the prompt (when inputting an image) is rather important ,and changes per model. Is there an overview on which model uses which template?
  • I often get output that is absolute nonsense? Sometimes i just get random chinese text/ text that doesnt pertain to the image/ text that stops mid-sentence/… . No clue what i’m doing wrong for this to happen.

Thanks for any help :slight_smile: !

from vllm import LLM
import PIL.Image
import multiprocessing as mp

mp.set_start_method("spawn", force=True)

def main():
    # Initialize the model
    llm = LLM(
        model="OpenGVLab/InternVL2-1B",
        gpu_memory_utilization=0.9,
        enforce_eager=True,
    )

    # Prepare the multimodal prompt with the correct placeholder
    prompt = "What is the content of this image in English?\n<image>"

    # Load image
    image = PIL.Image.open("./image.png")

    # Generate
    outputs = llm.generate({
        "prompt": prompt,
        "multi_modal_data": {"image": image},
    })

    # Print results
    for o in outputs:
        print(o.outputs[0].text)

if __name__ == "__main__":
    main()

You can run 7B models on a 12GB GPU, but only with careful memory management—reduce max_model_len, batch size (max_num_seqs), and set gpu_memory_utilization below 1.0. Also, disabling CUDA graphs with enforce_eager=True helps lower memory usage, but may reduce performance. See conserving memory and context length and batch size for details.

Prompt templates are model-specific and must match the format expected by each model (e.g., <image>, <|image_1|>, etc.). Refer to the HuggingFace model card or vLLM’s vision-language example for the correct template. Incorrect templates or tokenization can cause nonsense or incomplete outputs. For InternVL2, use the chat template as shown in the official example.

Would you like more detail or a code example for a specific model?

Sources:

The links for “official example” and “vision-language example” are not working. Can you point me to the correct examples? Also, give an overview of all the “tricks” for lowering memory usage.
Also, is the code fully correct for multimodal image inference? Can you give example code?

The correct example for multimodal image inference is vision_language.py and for multi-image input, see vision_language_multi_image.py. For InternVL2, use the chat template as shown in these examples and the multimodal inputs documentation.

To lower memory usage:

  • Reduce max_model_len and max_num_seqs (docs)
  • Set gpu_memory_utilization below 1.0 (docs)
  • Use quantized models if available
  • Use tensor parallelism if you have multiple GPUs
  • Limit the number of images per prompt with limit_mm_per_prompt
  • Adjust multi-modal processor arguments (e.g., mm_processor_kwargs)
  • Disable CUDA graphs with enforce_eager=True (docs)

Here is a correct example for single-image inference with InternVL2:

from vllm import LLM
from PIL import Image

llm = LLM(
    model="OpenGVLab/InternVL2-1B",
    gpu_memory_utilization=0.9,
    enforce_eager=True,
    limit_mm_per_prompt={"image": 1},
)

prompt = "<|im_start|>User\n<image>\nWhat is the content of this image in English?\n<|im_end|>\n<|im_start|>Assistant\n"
image = Image.open("./image.png")

outputs = llm.generate({
    "prompt": prompt,
    "multi_modal_data": {"image": image},
})

for o in outputs:
    print(o.outputs[0].text)

Would you like more detail on any of these points?

Sources:

Give me the correct prompt templates (in code) for all the following model families:

  • internVL
  • Qwen
  • deepseek VL
  • gemma
  • minicpm
  • llama
  • llava

One other problem i have is that the gemma models seem to be locked? i cant use them withouth agreeing to certain things?

It could be possible that the correct prompt templates for each model family are as follows, based on official vLLM and HuggingFace examples. Note that prompt format is critical and must match the model’s expected input. For Gemma models, it might be that you need to accept a license on HuggingFace before downloading—this is a restriction from the model provider, not vLLM.

Prompt templates (Python code snippets):

  • InternVL:
    messages = [[{'role': 'user', 'content': f"<image>\n{question}"}] for question in questions]
    prompts = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    
  • Qwen:
    messages = [[{'role': 'user', 'content': [{'type': 'image'}, {'type': 'text', 'text': question}]}] for question in questions]
    prompts = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    
  • Deepseek VL:
    prompts = [f"<|User|>: <image>\n{question}\n\n<|Assistant|>:" for question in questions]
    
  • Gemma:
    prompts = [("<bos><start_of_turn>user\n"
                f"<start_of_image>{question}<end_of_turn>\n"
                "<start_of_turn>model\n") for question in questions]
    
  • MiniCPM:
    prompts = [f"<image>{question}" for question in questions]
    
  • Llama (vision):
    prompts = [f"USER: <image>\n{question}\nASSISTANT:" for question in questions]
    
  • LLaVA:
    prompts = [f"USER: <image>\n{question}\nASSISTANT:" for question in questions]
    

For Gemma models, you must agree to the license on HuggingFace before downloading—this is not a vLLM limitation (source). Would you like more detail or code for a specific model?

Sources:

When processing the image of a white kitten in a field, the output is drastically cut off? How can this be?

/home/bdck/PROJECTS_WSL/VLM/venv_vlm/bin/python /home/bdck/PROJECTS_WSL/VLM/VLM_redoUsingVllm.py
INFO 05-27 15:28:28 [init.py:239] Automatically detected platform cuda.
INFO 05-27 15:28:35 [config.py:600] This model supports multiple tasks: {‘classify’, ‘embed’, ‘score’, ‘reward’, ‘generate’}. Defaulting to ‘generate’.
INFO 05-27 15:28:35 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 05-27 15:28:35 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 05-27 15:28:36 [utils.py:2273] We must use the spawn multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to ‘spawn’. See Troubleshooting - vLLM for more information. Reason: CUDA is initialized
INFO 05-27 15:28:39 [init.py:239] Automatically detected platform cuda.
INFO 05-27 15:28:40 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model=‘OpenGVLab/InternVL3-2B’, speculative_config=None, tokenizer=‘OpenGVLab/InternVL3-2B’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=512, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend=‘xgrammar’, reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=OpenGVLab/InternVL3-2B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={“splitting_ops”:,“compile_sizes”:,“cudagraph_capture_sizes”:,“max_capture_size”:0}
2025-05-27 15:28:40,701 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
WARNING 05-27 15:28:40 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f07415a1990>
INFO 05-27 15:28:41 [parallel_state.py:957] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
WARNING 05-27 15:28:41 [interface.py:304] Using ‘pin_memory=False’ as WSL is detected. This may slow down the performance.
INFO 05-27 15:28:41 [cuda.py:221] Using Flash Attention backend on V1 engine.
INFO 05-27 15:28:41 [gpu_model_runner.py:1258] Starting to load model OpenGVLab/InternVL3-2B…
INFO 05-27 15:28:41 [config.py:3334] cudagraph sizes specified by model runner is overridden by config
INFO 05-27 15:28:42 [topk_topp_sampler.py:44] Currently, FlashInfer top-p & top-k sampling sampler is disabled because FlashInfer>=v0.2.3 is not backward compatible. Falling back to the PyTorch-native implementation of top-p & top-k sampling.
INFO 05-27 15:28:42 [weight_utils.py:265] Using model weights format [‘*.safetensors’]
INFO 05-27 15:28:42 [weight_utils.py:315] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.70it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.70it/s]

INFO 05-27 15:28:43 [loader.py:447] Loading weights took 0.62 seconds
INFO 05-27 15:28:43 [gpu_model_runner.py:1273] Model loading took 3.9110 GiB and 1.253380 seconds
INFO 05-27 15:28:43 [gpu_model_runner.py:1542] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 3 image items of the maximum feature size.
INFO 05-27 15:28:45 [kv_cache_utils.py:578] GPU KV cache size: 18,528 tokens
INFO 05-27 15:28:45 [kv_cache_utils.py:581] Maximum concurrency for 512 tokens per request: 36.19x
INFO 05-27 15:28:45 [core.py:162] init engine (profile, create kv cache, warmup model) took 2.03 seconds

The content of this image is in English. It shows a picture of a small
[rank0]:[W527 15:28:46.848639031 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see Distributed communication package - torch.distributed — PyTorch 2.7 documentation (function operator())

Your output is cut off because your max_seq_len (context length) is set to 512, which is very short for multimodal models—image tokens plus text tokens can easily exceed this, causing the model to truncate or stop mid-sentence. This is a common cause of incomplete outputs, especially for vision models where the image embedding consumes many tokens (source).

To fix this, increase max_seq_len (or max_model_len) to the largest value your GPU can handle (e.g., 1024, 2048, or higher if memory allows). If you still hit memory errors, reduce batch size (max_num_seqs=1), use quantized models, or lower image resolution via mm_processor_kwargs. See context length and batch size and conserving memory for more.

Would you like a code example for setting these parameters?

Sources:

With nothing changed, i set max_model_len = 30000 , which should be more than enough to get a nice input+output. However the only thing i get is still cut off:

The image depicts a young kitten walking on a grassy field. The kitten is
[rank0]:[W527 15:36:42.900602291 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see Distributed communication package - torch.distributed — PyTorch 2.7 documentation (function operator())

If you set max_model_len=30000 but still get cut-off output, it’s likely your GPU does not have enough memory to allocate a KV cache for such a long context, so vLLM silently reduces the effective context length or fails to allocate enough cache (see here). You can check the logs for lines like “The model’s max seq len (30000) is larger than the maximum number of tokens that can be stored in KV cache (…)” or similar warnings.

To resolve this, set max_model_len to a value that fits your GPU memory (e.g., 1024–4096 for 12GB VRAM), and check the actual allocated KV cache size in the logs. Also, reduce max_num_seqs to 1 and lower image resolution with mm_processor_kwargs if needed (conserving memory). Would you like a code example for these settings?

Sources:

INFO 05-27 15:44:22 [loader.py:447] Loading weights took 0.40 seconds
INFO 05-27 15:44:22 [gpu_model_runner.py:1273] Model loading took 1.7627 GiB and 42.797461 seconds
INFO 05-27 15:44:22 [gpu_model_runner.py:1542] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 3 image items of the maximum feature size.
INFO 05-27 15:44:24 [kv_cache_utils.py:578] GPU KV cache size: 296,016 tokens
INFO 05-27 15:44:24 [kv_cache_utils.py:581] Maximum concurrency for 4,096 tokens per request: 72.27x
INFO 05-27 15:44:24 [core.py:162] init engine (profile, create kv cache, warmup model) took 1.79 seconds
The image showcases a cute, small kitten with a gray and white coat running on

Even though your GPU KV cache size is 296,016 tokens, your prompt and output are still cut off because the model’s max_model_len or the prompt template may be limiting the actual output length, or the model’s stopping criteria are being triggered early. Also, if you set a low max_tokens in your sampling parameters, the output will be short by default (see here).

To get longer outputs, explicitly set a higher max_tokens in your sampling parameters (e.g., max_tokens=200). Also, ensure your prompt template is correct for the model. Would you like a code example for this?

Sources:

with all the info from above, create the best coding example. Using the correct params, etc etc

Here is a best-practice example for multimodal inference with vLLM, using InternVL2-1B, with all recommended parameters for a 12GB GPU. This code sets a reasonable context length, batch size, and sampling parameters, and uses the correct prompt template for InternVL. Adjust max_model_len and image path as needed.

from vllm import LLM, SamplingParams
from PIL import Image
from transformers import AutoTokenizer

# Set up model and tokenizer
model_name = "OpenGVLab/InternVL2-1B"
llm = LLM(
    model=model_name,
    gpu_memory_utilization=0.9,      # Use 90% of GPU memory
    enforce_eager=True,              # Lower memory usage (slower)
    max_model_len=2048,              # Fits 12GB VRAM, adjust if needed
    max_num_seqs=1,                  # Single request at a time
    limit_mm_per_prompt={"image": 1} # Only 1 image per prompt
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Prepare prompt using the correct template
question = "What is the content of this image in English?"
messages = [{"role": "user", "content": f"<image>\n{question}"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Load image
image = Image.open("./image.png").convert("RGB")

# Set sampling parameters for longer, deterministic output
sampling_params = SamplingParams(
    temperature=0.0,     # Deterministic output
    max_tokens=200,      # Allow up to 200 output tokens
    top_p=1.0,
    top_k=1
)

# Run inference
outputs = llm.generate({
    "prompt": prompt,
    "multi_modal_data": {"image": image},
}, sampling_params=sampling_params)

# Print result
for o in outputs:
    print(o.outputs[0].text)

This example ensures correct prompt formatting, memory usage, and output length. For other models, adjust the prompt template as described previously.

Sources: