I’m trying to set up a Serverless Endpoint on RunPod with vLLM (with Qwen2.5-VL-3B-Instruct).
My goal is to get a lot of images descriptions (batch inference).
Here is how i set it up:
Docker Image:
runpod/worker-v1-vllm:v2.7.0stable-cuda12.1.0
GPU:
48GB pro
(L40, L40S, or RTX 6000 ADA)
ENV vars:
MODEL_NAME=Qwen/Qwen2.5-VL-3B-Instruct
DOWNLOAD_DIR=/runpod-volume
DTYPE=float16
GPU_MEMORY_UTILIZATION=0.90
ENABLE_PREFIX_CACHING=1
QUANTIZATION=bitsandbytes
LIMIT_MM_PER_PROMPT=image=10,video=0
This call with one image works :
curl https://api.runpod.ai/v2/xxxxxxxx/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer rpa_xxxxxxxxxxxxxxxx" \
-d '{
"model": "Qwen/Qwen2.5-VL-3B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image."
},
{
"type": "image_url",
"image_url": {
"url": "https://www.site.com/images.jpeg"
}
}
]
}
],
"max_tokens": 1000,
"temperature": 0.7
}'
Now I have several questions.
Is it worth passing multiple images to the model in a single call? Will it be more efficient?
If so, how should I pass the parameters?
Did I miss anything in the ENV vars that would be important to go faster?
Thank you very much for any help or tips you can give me.