When running the DeepSeek model in the ROCm ecosystem with --tensor-parallel-size=16
, I observed that for num_tokens=4k
, topk=8
, and hidden_size=7K
, the vllm::moe::moe_sum_kernel
is about three times faster than at::sum_out
.
I’m curious why vllm::moe::moe_sum_kernel
is only used when num_tokens
is 2, 3, or 4. It seems that this kernel could also be beneficial for larger topk
values. From what I can see, this logic is implemented in the moe_sum
function in the vllm/csrc/moe/moe_align_sum_kernels.cu
file.
Here is the command I used to launch the model:
vllm serve /DeepSeek-R1-INT8 \
--trust-remote-code \
--distributed-executor-backend ray \
--dtype bfloat16 \
--max-model-len 32748 \
-tp 16 \
--gpu-memory-utilization 0.9 \
--max-num-seqs 128 \
--speculative_config '{"num_speculative_tokens": 1}' \
--block-size 64 \
--max-seq-len-to-capture 32748 \
--port 8000
Here is the link to the model: meituan/DeepSeek-R1-Block-INT8 · Hugging Face