A QPS (queries per second) of around 2 is lower than typical for vLLM, but actual throughput depends on several factors: model size, hardware, batch size, and input type. For multi-modal models (prompt + image), throughput is often lower than pure text due to image processing overhead. Also, using only one GPU (--tensor-parallel-size 1) limits parallelism, and a high --max-num-seqs does not guarantee high throughput if requests are not batched efficiently or if the model/hardware is a bottleneck. For reference, vLLM benchmarks for Qwen2.5-VL-72B on 4xA100 GPUs show output token throughput of ~437 tok/s and request throughput of ~3.8 req/s for vision tasks, so your QPS is within the expected range for a single GPU and multi-modal input, though it could be improved with more GPUs or optimized batching strategies (Qwen2.5-VL Usage Guide).
If you want to increase QPS, consider increasing the number of GPUs (--tensor-parallel-size), using data parallelism, or optimizing your request batching. Would you like a step-by-step breakdown of possible optimizations?