what is the recommended method to start up the vllm server engine for inferencing for InternVL3_5-8B, getting 2 qps?

RupasaiR · September 14, 2025, 5:43pm

I am using VLLM library for inference. For some strange reason I am getting qps around 2. Is this expected?. I am starting the server as follows:

vllm serve “model_path” --trust-remote-code --max-num-seqs 1000 --max-model-len 8192 --gpu-memory-utilization 0.95 --limit-mm-per-prompt ‘{“image”: 1}’ --tensor-parallel-size 1 --trust-remote-code --port 8080

Input: prompt + single image
Output: ~100 tokens
vllm version I am using: 10.1.1

RunLLM · September 14, 2025, 5:43pm

A QPS (queries per second) of around 2 is lower than typical for vLLM, but actual throughput depends on several factors: model size, hardware, batch size, and input type. For multi-modal models (prompt + image), throughput is often lower than pure text due to image processing overhead. Also, using only one GPU (--tensor-parallel-size 1) limits parallelism, and a high --max-num-seqs does not guarantee high throughput if requests are not batched efficiently or if the model/hardware is a bottleneck. For reference, vLLM benchmarks for Qwen2.5-VL-72B on 4xA100 GPUs show output token throughput of ~437 tok/s and request throughput of ~3.8 req/s for vision tasks, so your QPS is within the expected range for a single GPU and multi-modal input, though it could be improved with more GPUs or optimized batching strategies (Qwen2.5-VL Usage Guide).

If you want to increase QPS, consider increasing the number of GPUs (--tensor-parallel-size), using data parallelism, or optimizing your request batching. Would you like a step-by-step breakdown of possible optimizations?

Sources:

Qwen2.5-VL Usage Guide

Topic		Replies	Views
Speeding up vllm inference for Qwen2.5-VL General	23	4346	June 27, 2025
The service carrying capacity and computing efficiency issues of vllm General	0	19	July 16, 2025
Problem with Qwen2.5-VL-7b General	3	559	July 9, 2025
Multimodal inference guideline? General	59	786	August 6, 2025
Qwen2.5-VL Serverless Endpoint on RunPod General	0	61	July 30, 2025

what is the recommended method to start up the vllm server engine for inferencing for InternVL3_5-8B, getting 2 qps?

Related topics