Is there any performance comparison between `at::sum_out` and `vllm::moe::moe_sum_kernel`?

yahuuu · June 6, 2025, 7:22am

When running the DeepSeek model in the ROCm ecosystem with --tensor-parallel-size=16, I observed that for num_tokens=4k, topk=8, and hidden_size=7K, the vllm::moe::moe_sum_kernel is about three times faster than at::sum_out.

I’m curious why vllm::moe::moe_sum_kernel is only used when num_tokens is 2, 3, or 4. It seems that this kernel could also be beneficial for larger topk values. From what I can see, this logic is implemented in the moe_sum function in the vllm/csrc/moe/moe_align_sum_kernels.cu file.

Here is the command I used to launch the model:

vllm serve /DeepSeek-R1-INT8 \
  --trust-remote-code \
  --distributed-executor-backend ray \
  --dtype bfloat16 \
  --max-model-len 32748 \
  -tp 16 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 128 \
  --speculative_config '{"num_speculative_tokens": 1}' \
  --block-size 64 \
  --max-seq-len-to-capture 32748 \
  --port 8000

Here is the link to the model: meituan/DeepSeek-R1-Block-INT8 · Hugging Face

RunLLM · June 6, 2025, 7:23am

The vllm::moe::moe_sum_kernel is only used for topk values of 2, 3, or 4 because the kernel is explicitly specialized and launched for these cases in the moe_sum function. For other topk values, the code falls back to at::sum_out, which is a general-purpose reduction. This logic is implemented as a switch statement in moe_sum, and there is no kernel instantiation for larger topk values in the current codebase. Thus, the performance benefit of the custom kernel is not leveraged for topk > 4 (see source code).

If you want to use the faster kernel for larger topk, you would need to extend the kernel and its dispatch logic to support those values. As of now, this is not implemented or documented in the provided data sources (see also). Would you like more detail on how to extend this kernel for higher topk?

Sources:

moe_align_sum_kernels.cu

Topic		Replies	Views
Comparsion with omniserve(Lserve, Qserve) General	0	45	April 14, 2025
High-Throughput kernel on single-node Benchmarking	1	60	June 23, 2025
The service carrying capacity and computing efficiency issues of vllm General	0	16	July 16, 2025
In single node deployment environment, how can we make vllm call unified_attention more often to trigger KVCache connector workload General	12	57	June 24, 2025
Questions about PagedAttention v1 General	7	22	July 14, 2025

Is there any performance comparison between `at::sum_out` and `vllm::moe::moe_sum_kernel`?

Related topics