QPS doesn't scale with multi-card GPU

voh · April 29, 2026, 6:32am

I use qwen3-VL-2B-Instruct for a image & video captioning work. The QPS/gpu (mixed image&video) on 1xA100 is 7.33. Scale to 8xA100 (TP=1, DP=8), the QPS/gpu drops to 0.93. So 8xA100 has almost the same throughput of 1xA100.
I monitored the cpu, gpu, memory utils. It’s very underused. KV cache util is very low (0.5%). I guess the bottleneck is in the mm preprocessing part which may not parallelized.

Anyone has experience on muticard GPU host vLLM tuning experience? Thanks!

RunLLM · April 29, 2026, 6:32am

Your observation is correct: when scaling Qwen3-VL-2B-Instruct to 8xA100 with TP=1, DP=8, throughput per GPU can drop sharply if the multimodal (MM) preprocessing (image/video decoding and feature extraction) becomes the bottleneck. This is a known issue—MM preprocessing is CPU-bound and not always parallelized efficiently, especially for video inputs, leading to low GPU and KV cache utilization even with many GPUs. Increasing DP alone does not parallelize MM preprocessing across all available CPU resources by default, and vLLM’s API server scale-out or MM processor caching may help but are not always sufficient for heavy video workloads or high concurrency scenarios. See Performance: Multi-Modal Benchmark on NVIDIA A100 and Qwen 2.5 VL 7B throughput.

To improve throughput, try: (1) increasing the number of API server processes with --api-server-count to parallelize input processing, (2) tuning VLLM_MEDIA_LOADING_THREAD_COUNT to match your CPU core count, and (3) enabling shared-memory MM processor caching with --mm-processor-cache-type shm if your workload has repeated MM inputs. For video, consider hardware-accelerated decoding and zero-copy transfer (see Enabling Zero-Copy Video with PyNvVideoCodec and IPC). However, even with these, MM preprocessing can remain the bottleneck for large-scale, mixed image/video workloads. Would you like a step-by-step breakdown of these tuning strategies?

Sources:

Hint: Mention @RunLLM in the post for followups.

yangshanjun · May 6, 2026, 3:13am

@RunLLM embed_tokens 只有在非pp的情况下，才会共享target model的吗

eagle.py 里 load_model 里只有在 if get_pp_group().world_size == 1: 才会共享？非pp的情况下，不支持mtp ?

RunLLM · May 6, 2026, 7:04am

是的，eagle.py 的 load_model 方法中，只有在 get_pp_group().world_size == 1（即未启用 pipeline parallelism，非PP）时，embed_tokens 才会与 target model 共享。在 PP 场景下（world_size > 1），不会自动共享 embed_tokens，因为 PP 下 embedding 层的分布和权重管理更复杂，直接共享可能导致错误。相关源码说明。

此外，MTP（Multi-Token Prediction）模型本身并不依赖于是否共享 embed_tokens，但在 PP 下，MTP draft model 目前通常不支持 pipeline parallelism，因为很多 MTP draft model 没有实现 SupportsPP 接口，导致无法通过 vLLM 的 PP 检查。相关讨论。需要更详细的代码分析吗？

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Problem with Qwen2.5-VL-7b General	3	1453	July 9, 2025
what is the recommended method to start up the vllm server engine for inferencing for InternVL3_5-8B, getting 2 qps? General	1	147	September 14, 2025
Weird benchmarking results regardin QwenVL/30B/8B/4B [solved - moe xd] General	4	800	October 21, 2025
The service carrying capacity and computing efficiency issues of vllm General	0	42	July 16, 2025
Speeding up vllm inference for Qwen2.5-VL General	23	7674	June 27, 2025

QPS doesn't scale with multi-card GPU

Related topics