I use qwen3-VL-2B-Instruct for a image & video captioning work. The QPS/gpu (mixed image&video) on 1xA100 is 7.33. Scale to 8xA100 (TP=1, DP=8), the QPS/gpu drops to 0.93. So 8xA100 has almost the same throughput of 1xA100.
I monitored the cpu, gpu, memory utils. It’s very underused. KV cache util is very low (0.5%). I guess the bottleneck is in the mm preprocessing part which may not parallelized.
Anyone has experience on muticard GPU host vLLM tuning experience? Thanks!
Your observation is correct: when scaling Qwen3-VL-2B-Instruct to 8xA100 with TP=1, DP=8, throughput per GPU can drop sharply if the multimodal (MM) preprocessing (image/video decoding and feature extraction) becomes the bottleneck. This is a known issue—MM preprocessing is CPU-bound and not always parallelized efficiently, especially for video inputs, leading to low GPU and KV cache utilization even with many GPUs. Increasing DP alone does not parallelize MM preprocessing across all available CPU resources by default, and vLLM’s API server scale-out or MM processor caching may help but are not always sufficient for heavy video workloads or high concurrency scenarios. See Performance: Multi-Modal Benchmark on NVIDIA A100 and Qwen 2.5 VL 7B throughput.
To improve throughput, try: (1) increasing the number of API server processes with --api-server-count to parallelize input processing, (2) tuning VLLM_MEDIA_LOADING_THREAD_COUNT to match your CPU core count, and (3) enabling shared-memory MM processor caching with --mm-processor-cache-type shm if your workload has repeated MM inputs. For video, consider hardware-accelerated decoding and zero-copy transfer (see Enabling Zero-Copy Video with PyNvVideoCodec and IPC). However, even with these, MM preprocessing can remain the bottleneck for large-scale, mixed image/video workloads. Would you like a step-by-step breakdown of these tuning strategies?