Npu 310p3 的生成速率

XiaoDouGeGe · December 2, 2025, 6:54am

NPU 310P3 对应的 vllm-ascend 版本是多少，推理服务的token生成速率有多少？

RunLLM · December 2, 2025, 6:54am

可能vllm-ascend目前主要在Atlas 300T A2等昇腾NPU上开发和测试，310P3的支持情况和性能数据在官方资料中没有明确说明。根据相关PR和讨论，310P系列由于算子支持受限，推理时会遇到兼容性和性能瓶颈，且有用户反馈部分算子在310P上有较多限制，建议关注昇腾官方文档和vllm-ascend仓库的最新进展。

至于token生成速度，现有公开数据主要基于A2卡，未见310P3的详细benchmark。A2卡上Qwen2-7B-Instruct模型推理速度约为输入2.56 token/s，输出42.72 token/s，仅供参考，310P3实际速度可能更低，且需等待CANN和算子支持完善后才能获得更准确的性能数据。如需310P3适配和性能提升，建议参与社区开发或关注后续版本更新。参考1 参考2

需要更详细的技术细节或适配建议吗？

Sources:

Hint: Mention @RunLLM in the post for followups.

XiaoDouGeGe · December 2, 2025, 7:04am

目前我的环境是
镜像 vllm-ascend:v0.9.2rc1-310p
算力 310P3 48GB * 8
模型 Qwen2.5-72B-Instruct

Avg prompt throughput: 4.x tokens/s

Avg generation throughput: 6.x tokens/s

推理服务部署命令为
export VLLM_USE_MODELSCOPE=True
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
vllm serve /data-new/models/Qwen2.5-72B-Instruct
–trust-remote-code
–tensor-parallel-size 8
–enforce-eager
–dtype float16
–max-model-len 4096
–gpu-memory-utilization 0.8
–compilation-config ‘{“custom_ops”:[“none”, “+rms_norm”, “+rotary_embedding”]}’

需要怎么调整（镜像、参数等），能提高token输出速度

XiaoDouGeGe · December 2, 2025, 7:22am

需要更详细的适配建议，来提高token生成效率

Topic		Replies	Views
Question about vllm-ascend performance on server with 8*910B3 Ascend Support	5	505	May 28, 2025
Do we support NPU 310 Ascend Support	3	147	November 21, 2025
我能在Ascend310B芯片上通过vllm-ascend插件部署Qwen2.5-vl吗？ Ascend Support	3	149	October 15, 2025
Vllm-ascend如何验证使用的是npu还是cpu General	5	213	August 6, 2025
How can vllm ascend support qwen3-vl-235b? Ascend Support	2	234	October 16, 2025

Npu 310p3 的生成速率

Related topics