Question about vllm-ascend performance on server with 8*910B3

kyle · May 28, 2025, 8:51am

Environment Information is as follows:

OS: Kylin10
Ascend NPU Driver: Ascend-hdk-910b-npu-driver_23.0.7_linux-aarch64.run
Ascend NPU Firmware: Ascend-hdk-910b-npu-firmware_7.1.0.11.220.run
Ascend Docker Runtime: Ascend-docker-runtime_5.0.RC3.2_linux-x86_64.run
Docker: docker-ce-26.1.3-1.el8.aarch64.rpm
Containerd: containerd.io-1.6.32-3.1.el8.aarch64.rpm
vllm-ascend: vllm-ascend-v0.8.5rc1

Run 32B model as follows:

export IMAGE=quay.io/ascend/vllm-ascend:v0.8.5rc1
docker run --rm
–name vllm-ascend-env
–device /dev/davinci0
–device /dev/davinci1
–device /dev/davinci2
–device /dev/davinci3
–device /dev/davinci_manager
–device /dev/devmm_svm
–device /dev/hisi_hdc
-v /usr/local/dcmi:/usr/local/dcmi
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info
-v /etc/ascend_install.info:/etc/ascend_install.info
-v /root/.cache:/root/.cache
-p 8000:8000
-v /home/test:/mnt
-e VLLM_USE_MODELSCOPE=True
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
-it $IMAGE
vllm serve /mnt/Qwen3-32B --tensor-parallel-size 4 --max-model-len 16384 --gpu-memory-utilization 0.85

API request result shows the processing speed is only 10 tokens/s:

Question:

Is the above performance normal? How can I improve it?
Because Running 32B model in Nvidia A10 based on vllm 0.8.5rc1 can achieve 25 tokens/s, 910B3 should not be worse than A10 as far as I know.

RunLLM · May 28, 2025, 8:52am

Based on the available data, 10 tokens/s for a Qwen3-32B model on Ascend 910B3 NPUs is lower than expected, especially since similar or better performance is reported on Nvidia A10 GPUs (25 tokens/s) with vLLM 0.8.5rc1. The vLLM-Ascend backend is still under active development, and performance tuning for Ascend NPUs may not be as mature as for Nvidia GPUs. There are no official benchmarks for 32B models on 910B3 in the provided sources, but users have reported that vLLM-Ascend performance is not yet on par with Nvidia GPUs and that multi-card (tensor parallel) support is still a work in progress (vllm-ascend PR, vllm-ascend repo).

To potentially improve performance, ensure you are using the latest vllm-ascend code, and check for any updates or tuning guides in the vllm-ascend repository. Also, verify that your environment variables and device bindings match the recommendations in the official documentation and PR discussions. If issues persist, consider opening an issue in the vllm-ascend repo with your environment details and logs for targeted support.

Would you like a step-by-step breakdown of possible tuning steps or troubleshooting for Ascend NPU performance?

Sources:

vllm-ascend PR

kyle · May 28, 2025, 9:02am

Thanks for reply.

What is the normal performance for Qwen3-32B model on Ascned 910B3 NPUs? I would like some step-by-step tuning advices for improving the performance, thanks very much.

RunLLM · May 28, 2025, 9:02am

It seems likely that there are no official or community-published benchmarks for Qwen3-32B on Ascend 910B3 NPUs in the provided sources, so “normal” performance is not clearly defined. Most reports indicate that vLLM-Ascend is still catching up to Nvidia GPU performance, and multi-card scaling is under development (vllm-ascend PR, vllm-ascend repo). Users have not shared specific tokens/s numbers for 910B3, but your observed 10 tokens/s is considered low compared to Nvidia A10 results.

Here are step-by-step tuning suggestions based on available documentation and user feedback:

Update vllm-ascend: Use the latest code from vllm-ascend, as performance improvements are ongoing.
Tensor Parallel Size: Ensure --tensor-parallel-size matches your available NPUs and is supported by your hardware.
Memory Utilization: Adjust --gpu-memory-utilization (e.g., try 0.9 or 0.95) to maximize NPU memory usage.
Batch Size: Increase --max-num-seqs if memory allows, as higher batch sizes can improve throughput.
Model Context Length: Lower --max-model-len if you do not need long contexts, as this reduces memory pressure.
Ascend Environment Variables: Set PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 and other recommended variables from the PR discussion.
Profile and Monitor: Use npu-smi and logs to check NPU utilization and memory usage.
Report Issues: If performance is still low, open an issue in the vllm-ascend repo with detailed logs and environment info for targeted help.

Would you like more detail on any of these steps?

Sources:

vllm-ascend PR

yikun · May 28, 2025, 4:36pm

@kyle Thanks for your info, for qwen3

For v0.7.3, we will add qwen3 support the v0.7.3.post1 [release] 0.7.3.post1 release checklist · Issue #871 · vllm-project/vllm-ascend · GitHub
- You can try quay.io/ascend/vllm-ascend:v0.7.3-dev or m.daocloud.io/quay.io/ascend/vllm-ascend:v0.7.3-dev
- Currently the code is freezing, It will be published in recent days
- You can also follow Optimization and Tuning — vllm-ascend to speed up
For v0.8/0.9, We are also working on graph mode to optimize, here is also a guide: [Guide]: Usage on Graph mode · Issue #767 · vllm-project/vllm-ascend · GitHub , it will be ready in next week.

RunLLM · May 28, 2025, 4:36pm

There are no official Qwen3-32B benchmarks for Ascend 910B3, but vLLM-Ascend v0.7.3-dev adds Qwen3 support and includes a performance tuning guide. For v0.8/0.9, graph mode optimizations are in progress and will be released soon (issue #767). You can try the latest dev images for better performance.

Would you like a step-by-step summary of the optimization guide for vLLM-Ascend?

Sources:

Topic		Replies	Views
Question about vLLM and vLLM Ascend verisoning policy Ascend Support	4	234	April 1, 2025
Vllm-ascend如何验证使用的是npu还是cpu General	5	84	August 6, 2025
我能在Ascend310B芯片上通过vllm-ascend插件部署Qwen2.5-vl吗？ Ascend Support	3	27	October 15, 2025
How to deploy vllm-ascend in AutoDL's 910B instance? Ascend Support	7	218	August 2, 2025
How to get torch-npu >= 2.5.1.dev20250308 Ascend Support	3	303	May 28, 2025

Question about vllm-ascend performance on server with 8*910B3

Related topics