Qwen2.5-VL Serverless Endpoint on RunPod

Alex · July 30, 2025, 7:26am

I’m trying to set up a Serverless Endpoint on RunPod with vLLM (with Qwen2.5-VL-3B-Instruct).
My goal is to get a lot of images descriptions (batch inference).

Here is how i set it up:

Docker Image:
runpod/worker-v1-vllm:v2.7.0stable-cuda12.1.0

GPU:
48GB pro
(L40, L40S, or RTX 6000 ADA)

ENV vars:

MODEL_NAME=Qwen/Qwen2.5-VL-3B-Instruct
DOWNLOAD_DIR=/runpod-volume
DTYPE=float16
GPU_MEMORY_UTILIZATION=0.90
ENABLE_PREFIX_CACHING=1
QUANTIZATION=bitsandbytes
LIMIT_MM_PER_PROMPT=image=10,video=0

This call with one image works :

curl https://api.runpod.ai/v2/xxxxxxxx/openai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer rpa_xxxxxxxxxxxxxxxx" \
  -d '{
    "model": "Qwen/Qwen2.5-VL-3B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Describe this image."
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://www.site.com/images.jpeg"
            }
          }
        ]
      }
    ],
    "max_tokens": 1000,
    "temperature": 0.7
  }'

Now I have several questions.
Is it worth passing multiple images to the model in a single call? Will it be more efficient?
If so, how should I pass the parameters?
Did I miss anything in the ENV vars that would be important to go faster?
Thank you very much for any help or tips you can give me.

Topic		Replies	Views
what is the recommended method to start up the vllm server engine for inferencing for InternVL3_5-8B, getting 2 qps? General	1	32	September 14, 2025
Speeding up vllm inference for Qwen2.5-VL General	23	5054	June 27, 2025
Multimodal inference guideline? General	59	1042	August 6, 2025
Is Batch Inference for Multimodal Models Truly Batch Inference? General	9	265	August 1, 2025
Make install easier General	11	215	July 24, 2025

Qwen2.5-VL Serverless Endpoint on RunPod

Related topics