QPS doesn't scale with multi-card GPU

I use qwen3-VL-2B-Instruct for a image & video captioning work. The QPS/gpu (mixed image&video) on 1xA100 is 7.33. Scale to 8xA100 (TP=1, DP=8), the QPS/gpu drops to 0.93. So 8xA100 has almost the same throughput of 1xA100.
I monitored the cpu, gpu, memory utils. It’s very underused. KV cache util is very low (0.5%). I guess the bottleneck is in the mm preprocessing part which may not parallelized.

Anyone has experience on muticard GPU host vLLM tuning experience? Thanks!

Your observation is correct: when scaling Qwen3-VL-2B-Instruct to 8xA100 with TP=1, DP=8, throughput per GPU can drop sharply if the multimodal (MM) preprocessing (image/video decoding and feature extraction) becomes the bottleneck. This is a known issue—MM preprocessing is CPU-bound and not always parallelized efficiently, especially for video inputs, leading to low GPU and KV cache utilization even with many GPUs. Increasing DP alone does not parallelize MM preprocessing across all available CPU resources by default, and vLLM’s API server scale-out or MM processor caching may help but are not always sufficient for heavy video workloads or high concurrency scenarios. See Performance: Multi-Modal Benchmark on NVIDIA A100 and Qwen 2.5 VL 7B throughput.

To improve throughput, try: (1) increasing the number of API server processes with --api-server-count to parallelize input processing, (2) tuning VLLM_MEDIA_LOADING_THREAD_COUNT to match your CPU core count, and (3) enabling shared-memory MM processor caching with --mm-processor-cache-type shm if your workload has repeated MM inputs. For video, consider hardware-accelerated decoding and zero-copy transfer (see Enabling Zero-Copy Video with PyNvVideoCodec and IPC). However, even with these, MM preprocessing can remain the bottleneck for large-scale, mixed image/video workloads. Would you like a step-by-step breakdown of these tuning strategies?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM embed_tokens 只有在 非pp的情况下,才会共享target model的吗

eagle.py 里 load_model 里只有在 if get_pp_group().world_size == 1: 才会共享?非pp的情况下,不支持mtp ?

是的,eagle.pyload_model 方法中,只有在 get_pp_group().world_size == 1(即未启用 pipeline parallelism,非PP)时,embed_tokens 才会与 target model 共享。在 PP 场景下(world_size > 1),不会自动共享 embed_tokens,因为 PP 下 embedding 层的分布和权重管理更复杂,直接共享可能导致错误。相关源码说明

此外,MTP(Multi-Token Prediction)模型本身并不依赖于是否共享 embed_tokens,但在 PP 下,MTP draft model 目前通常不支持 pipeline parallelism,因为很多 MTP draft model 没有实现 SupportsPP 接口,导致无法通过 vLLM 的 PP 检查。相关讨论。需要更详细的代码分析吗?

Sources:

Hint: Mention @RunLLM in the post for followups.