Is there any performance comparison between `at::sum_out` and `vllm::moe::moe_sum_kernel`?

When running the DeepSeek model in the ROCm ecosystem with --tensor-parallel-size=16, I observed that for num_tokens=4k, topk=8, and hidden_size=7K, the vllm::moe::moe_sum_kernel is about three times faster than at::sum_out.

I’m curious why vllm::moe::moe_sum_kernel is only used when num_tokens is 2, 3, or 4. It seems that this kernel could also be beneficial for larger topk values. From what I can see, this logic is implemented in the moe_sum function in the vllm/csrc/moe/moe_align_sum_kernels.cu file.

Here is the command I used to launch the model:

vllm serve /DeepSeek-R1-INT8 \
  --trust-remote-code \
  --distributed-executor-backend ray \
  --dtype bfloat16 \
  --max-model-len 32748 \
  -tp 16 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 128 \
  --speculative_config '{"num_speculative_tokens": 1}' \
  --block-size 64 \
  --max-seq-len-to-capture 32748 \
  --port 8000

Here is the link to the model: meituan/DeepSeek-R1-Block-INT8 · Hugging Face

The vllm::moe::moe_sum_kernel is only used for topk values of 2, 3, or 4 because the kernel is explicitly specialized and launched for these cases in the moe_sum function. For other topk values, the code falls back to at::sum_out, which is a general-purpose reduction. This logic is implemented as a switch statement in moe_sum, and there is no kernel instantiation for larger topk values in the current codebase. Thus, the performance benefit of the custom kernel is not leveraged for topk > 4 (see source code).

If you want to use the faster kernel for larger topk, you would need to extend the kernel and its dispatch logic to support those values. As of now, this is not implemented or documented in the provided data sources (see also). Would you like more detail on how to extend this kernel for higher topk?

Sources: