Qwen 2.5 VL for videos

KerolosAtef · August 26, 2025, 9:35am

Could you share a simple video example of using Qwen 2.5 VL with vllm? What is the best configuration that can be used for videos to speed up the inference and optimize the memory consumption?

I tried this example which needed to be optimized
``python
from transformers import AutoProcessor

from vllm import LLM, SamplingParams

from qwen_vl_utils import process_vision_info

MODEL_PATH = “Qwen2.5-VL-7B-Instruct”

llm = LLM(

*model*=MODEL_PATH,

)

sampling_params = SamplingParams(

*temperature*=0.1,

*top_p*=0.001,

*repetition_penalty*=1.05,

*max_tokens*=4096,

*stop_token_ids*=\[\],

)

video_messages = [

{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": \[

        {"type": "text", "text": "Describe the video content in detail."},

        {

            "type": "video", 

            "video": "/home/jupyter/APIs/Main_endpoints/local_videos/episode_1.mp4",

            "total_pixels": 20480 \* 28 \* 28, "min_pixels": 16 \* 28 \* 28,

            *# "fps":0.1,*

            "nframes":32,

        }

    \]

},

]

# Here we use video messages as a demonstration

messages = video_messages

processor = AutoProcessor.from_pretrained(MODEL_PATH)

prompt = processor.apply_chat_template(

messages,

*tokenize*=False,

*add_generation_prompt*=True,

)

image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

print(“Video FPS:”, video_kwargs.get(“fps”, “Not provided”))

mm_data = {}

if image_inputs is not None:

mm_data\["image"\] = image_inputs

if video_inputs is not None:

mm_data\["video"\] = video_inputs

llm_inputs = {

"prompt": prompt,

"multi_modal_data": mm_data,

*# FPS will be returned in video_kwargs*

"mm_processor_kwargs": video_kwargs,

}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)

for output in outputs:

generated_text = output.outputs\[0\].text

print(generated_text)

print('--------------------------')

```

RunLLM · August 26, 2025, 9:36am

A simple, optimized video example for Qwen2.5-VL with vLLM is shown below. To speed up inference and reduce memory, set limit_mm_per_prompt={"video": 1} to limit video items per prompt, and use mm_processor_kwargs to restrict video resolution and frame count (e.g., "max_pixels": 768*768, "nframes": 8). This reduces the number of tokens and memory used. Also, set gpu_memory_utilization to a value like 0.8 and enforce_eager=True if you encounter CUDA graph memory issues. Example:

from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

MODEL_PATH = "Qwen/Qwen2.5-VL-7B-Instruct"
VIDEO_PATH = "/home/jupyter/APIs/Main_endpoints/local_videos/episode_1.mp4"

llm = LLM(
    model=MODEL_PATH,
    gpu_memory_utilization=0.8,
    enforce_eager=True,
    limit_mm_per_prompt={"video": 1},
    mm_processor_kwargs={"max_pixels": 768*768, "nframes": 8, "fps": 1},
)

sampling_params = SamplingParams(
    temperature=0.1,
    top_p=0.001,
    repetition_penalty=1.05,
    max_tokens=1024,
    stop_token_ids=[],
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "text", "text": "Describe the video content in detail."},
        {"type": "video", "video": VIDEO_PATH, "total_pixels": 768*768, "nframes": 8, "fps": 1},
    ]},
]

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
mm_data = {}
if video_inputs is not None:
    mm_data["video"] = video_inputs

llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,
    "mm_processor_kwargs": video_kwargs,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
for output in outputs:
    print(output.outputs[0].text)

For more details and further memory optimization tips, see vLLM multimodal docs and conserving memory.

Would you like a deeper explanation of each configuration or more advanced optimization strategies?

Sources:

Topic		Replies	Views
Speeding up vllm inference for Qwen2.5-VL General	23	7085	June 27, 2025
Why is inference for Qwen 2.5 VL so slow when we send an image? General	5	1562	August 20, 2025
How to give video input to Qwen-3VL with online serving General	2	871	November 5, 2025
Problem with Qwen2.5-VL-7b General	3	1272	July 9, 2025
Jina AI VLM - Support help General	3	77	October 12, 2025

Qwen 2.5 VL for videos

Related topics