what is the recommended method to start up the vllm server engine for inferencing for InternVL3_5-8B, getting 2 qps?

I am using VLLM library for inference. For some strange reason I am getting qps around 2. Is this expected?. I am starting the server as follows:

vllm serve “model_path” --trust-remote-code --max-num-seqs 1000 --max-model-len 8192 --gpu-memory-utilization 0.95 --limit-mm-per-prompt ‘{“image”: 1}’ --tensor-parallel-size 1 --trust-remote-code --port 8080

Input: prompt + single image
Output: ~100 tokens
vllm version I am using: 10.1.1

A QPS (queries per second) of around 2 is lower than typical for vLLM, but actual throughput depends on several factors: model size, hardware, batch size, and input type. For multi-modal models (prompt + image), throughput is often lower than pure text due to image processing overhead. Also, using only one GPU (--tensor-parallel-size 1) limits parallelism, and a high --max-num-seqs does not guarantee high throughput if requests are not batched efficiently or if the model/hardware is a bottleneck. For reference, vLLM benchmarks for Qwen2.5-VL-72B on 4xA100 GPUs show output token throughput of ~437 tok/s and request throughput of ~3.8 req/s for vision tasks, so your QPS is within the expected range for a single GPU and multi-modal input, though it could be improved with more GPUs or optimized batching strategies (Qwen2.5-VL Usage Guide).

If you want to increase QPS, consider increasing the number of GPUs (--tensor-parallel-size), using data parallelism, or optimizing your request batching. Would you like a step-by-step breakdown of possible optimizations?

Sources: