调用vllm的python接口，推理Qwen3-VL模型

sxcnmslll · December 5, 2025, 9:42am

我要通过vllm的python接口，实例化一个Qwen3-VL 8B模型。再使用这个模型，对我的多模态数据集进行推理。帮我实现完整的代码

我的多模态数据如下：

{“messages”: [{“role”: “user”, “content”: “这张图片有什么”}, {“role”: “assistant”, “content”: “小狗”}], “images”: [“/xxx/x.jpg”]}
{“messages”: [{“role”: “user”, “content”: “这张图片有什么”}, {“role”: “assistant”, “content”: “小猫”}], “images”: [“/YYY/Z.jpg”]}

RunLLM · December 5, 2025, 9:43am

目前vLLM官方文档和代码中没有关于Qwen3-VL-8B模型的多模态推理完整示例，但Qwen2-VL和Qwen2.5-VL的多模态推理流程可参考。你需要：1）实例化LLM对象，2）用AutoProcessor处理messages和图片，3）构造prompt和multi_modal_data，4）调用llm.generate进行推理。Qwen3-VL-8B的具体支持情况和API参数未在数据源中明确说明，以下代码以Qwen2-VL为例，Qwen3-VL-8B如接口一致可直接套用，否则需查阅Qwen3-VL-8B官方文档。

from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info
from PIL import Image

MODEL_PATH = "Qwen/Qwen2-VL-7B-Instruct"  # Qwen3-VL-8B请替换为实际路径
llm = LLM(model=MODEL_PATH, limit_mm_per_prompt={"image": 1})

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image", "image": "/xxx/x.jpg"},
        {"type": "text", "text": "这张图片有什么"}
    ]}
]
processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, _ = process_vision_info(messages)
llm_inputs = {"prompt": prompt, "multi_modal_data": {"image": image_inputs}}
sampling_params = SamplingParams(temperature=0.1, max_tokens=256)
outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

如需批量处理多组数据，将llm_inputs列表化即可。Qwen3-VL-8B如有特殊参数或API，请参考其官方文档。

需要更详细的逐步说明吗？

Sources:

Hint: Mention @RunLLM in the post for followups.

sxcnmslll · December 5, 2025, 9:47am

@RunLLM
我的数据集有所变化：

{“messages”: [{“role”: “user”, “content”: “两张图片有什么区别”}, {“role”: “assistant”, “content”: “前一张是小猫，后一张是小狗”}], “images”: [“/xxx/x.jpg”, “/xxx/x.png”]}

RunLLM · December 5, 2025, 9:47am

你可以将多张图片作为列表传递给multi_modal_data的"image"字段，并在prompt中用合适的模板插入多个图片占位符。例如：

from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

MODEL_PATH = "Qwen/Qwen2-VL-7B-Instruct"  # Qwen3-VL-8B请替换为实际路径
llm = LLM(model=MODEL_PATH, limit_mm_per_prompt={"image": 2})

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image", "image": "/xxx/x.jpg"},
        {"type": "image", "image": "/xxx/x.png"},
        {"type": "text", "text": "两张图片有什么区别"}
    ]}
]
processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, _ = process_vision_info(messages)
llm_inputs = {"prompt": prompt, "multi_modal_data": {"image": image_inputs}}
sampling_params = SamplingParams(temperature=0.1, max_tokens=256)
outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

如需支持多图，需设置limit_mm_per_prompt={“image”: 2}，并保证prompt模板与图片数量对应。Qwen3-VL-8B如有特殊要求请查阅其官方文档。

需要更详细的解释吗？

Sources:

Hint: Mention @RunLLM in the post for followups.

sxcnmslll · December 5, 2025, 9:50am

@RunLLM 执行报错了：
(vllm) python kk.py
INFO 12-05 17:49:09 [init.py:216] Automatically detected platform cuda.
Traceback (most recent call last):
File “/mnt/data/cpfs/lishan/model_train/kk.py”, line 3, in
from qwen_vl_utils import process_vision_info
ModuleNotFoundError: No module named ‘qwen_vl_utils’

RunLLM · December 5, 2025, 9:51am

你的报错是“ModuleNotFoundError: No module named ‘qwen_vl_utils’”，说明你的环境中没有安装qwen-vl-utils库。你需要先安装它：

pip install qwen-vl-utils

安装后即可正常导入process_vision_info。如需更详细的排查步骤请告知。

Sources:

qwen-vl-utils安装提示

Hint: Mention @RunLLM in the post for followups.

sxcnmslll · December 5, 2025, 11:07am

@RunLLM IMAGE_MAX_TOKEN_NUM如何配置？我希望动态适配我实际的图像token数量

RunLLM · December 5, 2025, 11:08am

vLLM并没有直接提供IMAGE_MAX_TOKEN_NUM参数的配置接口。多模态图片token数量的分配，主要通过limit_mm_per_prompt和mm_processor_kwargs（如max_pixels、width、height）等参数间接控制。你可以通过设置mm_processor_kwargs中的max_pixels，动态调整图片token数量的上限，从而更贴合实际图片分辨率。例如：

llm = LLM(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    limit_mm_per_prompt={"image": 8},
    mm_processor_kwargs={"max_pixels": 512 * 512},
)

这样vLLM会按你设定的max_pixels进行内存和token预算，而不是用模型的最大默认值。注意：这些参数只影响内存profiling和token预算，不会影响实际推理时图片的处理方式。
如需进一步动态适配不同图片，可在服务启动前根据实际图片分布调整max_pixels参数，但vLLM不支持每次推理动态更改。

需要更详细的原理说明或代码示例吗？

Sources:

Hint: Mention @RunLLM in the post for followups.

sxcnmslll · December 5, 2025, 11:11am

@RunLLM 在vllm的日志里，没有看到图像分辨率相关的配置：
INFO 12-05 19:05:08 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model=‘/msswift_tmp/Qwen3-VL-8B-Instruct’, speculative_config=None, tokenizer=‘/msswift_tmp/Qwen3-VL-8B-Instruct’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=10240, download_dir=None, load_format=safetensors, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/msswift_tmp/Qwen3-VL-8B-Instruct, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={“level”:3,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:,“splitting_ops”:[“vllm.unified_attention”,“vllm.unified_attention_with_output”,“vllm.mamba_mixer2”,“vllm.mamba_mixer”,“vllm.short_conv”,“vllm.linear_attention”,“vllm.plamo2_mamba_mixer”,“vllm.gdn_attention”,“vllm.sparse_attn_indexer”],“use_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“cudagraph_mode”:[2,1],“use_cudagraph”:true,“cudagraph_num_of_warmups”:1,“cudagraph_capture_sizes”:[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“use_inductor_graph_partition”:false,“pass_config”:{},“max_capture_size”:512,“local_cache_dir”:null}

RunLLM · December 5, 2025, 11:12am

vLLM日志不会直接显示图片分辨率或相关配置。图片分辨率的处理和token预算，需通过mm_processor_kwargs（如max_pixels、width、height）等参数在实例化LLM时手动指定，否则vLLM会采用模型默认值。日志中只会显示模型、推理参数等全局配置，不会包含图片分辨率细节。

如需动态适配图片分辨率，请在LLM初始化时设置mm_processor_kwargs，但相关信息不会自动出现在日志里。如需确认分辨率配置是否生效，需自行检查代码参数或在推理前打印相关变量。

需要更详细的配置说明吗？

Sources:

相关issue说明

Hint: Mention @RunLLM in the post for followups.

sxcnmslll · December 5, 2025, 11:13am

@RunLLM 告诉我查看分辨率配置是否生效的完整方法

RunLLM · December 5, 2025, 11:14am

vLLM中，判断图片分辨率（如max_pixels、width、height）配置是否生效，最直接的方法是：推理前后分别打印mm_processor_kwargs参数和图片实际shape，或在process_vision_info处理后检查图片尺寸。此外，若配置无效，HuggingFace处理器通常会有warning或报错提示。

官方日志不会显示分辨率参数，需在代码中手动print相关变量，或通过处理器返回的图片shape确认。例如：

print(llm.llm_engine.model_config.mm_processor_kwargs)
print(image_inputs[0].shape)  # 检查图片shape

需要详细代码示例或逐步操作说明吗？

Sources:

Hint: Mention @RunLLM in the post for followups.

sxcnmslll · December 5, 2025, 11:16am

@RunLLM 我想知道，每一次推理的时候，图像占了多少token

RunLLM · December 5, 2025, 11:16am

vLLM不会在日志中直接显示每次推理时图片占用了多少token，但你可以通过输出outputs[0].prompt_token_ids的长度，或统计图片token id出现的次数来获知。例如：

print(outputs[0].prompt_token_ids.count(图片token_id))

图片token_id可通过处理器或模型配置获取。这样可精确统计每次推理图片实际占用的token数。相关讨论

需要更详细的代码示例吗？

Sources:

相关issue代码与说明

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Qwen3-vl执行2D grounding任务时的图像大小 General	15	175	May 21, 2026
Issue: Unable to pass precomputed image embeddings to vLLM General	12	452	January 14, 2026
what is the recommended method to start up the vllm server engine for inferencing for InternVL3_5-8B, getting 2 qps? General	1	141	September 14, 2025
Speeding up vllm inference for Qwen2.5-VL General	23	7624	June 27, 2025
Multimodal inference guideline? General	59	2943	August 6, 2025

调用vllm的python接口，推理Qwen3-VL模型

Related topics