Question about vllm-ascend performance on server with 8*910B3

Environment Information is as follows:

OS: Kylin10
Ascend NPU Driver: Ascend-hdk-910b-npu-driver_23.0.7_linux-aarch64.run
Ascend NPU Firmware: Ascend-hdk-910b-npu-firmware_7.1.0.11.220.run
Ascend Docker Runtime: Ascend-docker-runtime_5.0.RC3.2_linux-x86_64.run
Docker: docker-ce-26.1.3-1.el8.aarch64.rpm
Containerd: containerd.io-1.6.32-3.1.el8.aarch64.rpm
vllm-ascend: vllm-ascend-v0.8.5rc1

Run 32B model as follows:

export IMAGE=quay.io/ascend/vllm-ascend:v0.8.5rc1
docker run --rm
–name vllm-ascend-env
–device /dev/davinci0
–device /dev/davinci1
–device /dev/davinci2
–device /dev/davinci3
–device /dev/davinci_manager
–device /dev/devmm_svm
–device /dev/hisi_hdc
-v /usr/local/dcmi:/usr/local/dcmi
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info
-v /etc/ascend_install.info:/etc/ascend_install.info
-v /root/.cache:/root/.cache
-p 8000:8000
-v /home/test:/mnt
-e VLLM_USE_MODELSCOPE=True
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
-it $IMAGE
vllm serve /mnt/Qwen3-32B --tensor-parallel-size 4 --max-model-len 16384 --gpu-memory-utilization 0.85

API request result shows the processing speed is only 10 tokens/s:

Question:

Is the above performance normal? How can I improve it?
Because Running 32B model in Nvidia A10 based on vllm 0.8.5rc1 can achieve 25 tokens/s, 910B3 should not be worse than A10 as far as I know.

1 Like

Based on the available data, 10 tokens/s for a Qwen3-32B model on Ascend 910B3 NPUs is lower than expected, especially since similar or better performance is reported on Nvidia A10 GPUs (25 tokens/s) with vLLM 0.8.5rc1. The vLLM-Ascend backend is still under active development, and performance tuning for Ascend NPUs may not be as mature as for Nvidia GPUs. There are no official benchmarks for 32B models on 910B3 in the provided sources, but users have reported that vLLM-Ascend performance is not yet on par with Nvidia GPUs and that multi-card (tensor parallel) support is still a work in progress (vllm-ascend PR, vllm-ascend repo).

To potentially improve performance, ensure you are using the latest vllm-ascend code, and check for any updates or tuning guides in the vllm-ascend repository. Also, verify that your environment variables and device bindings match the recommendations in the official documentation and PR discussions. If issues persist, consider opening an issue in the vllm-ascend repo with your environment details and logs for targeted support.

Would you like a step-by-step breakdown of possible tuning steps or troubleshooting for Ascend NPU performance?

Sources:

Thanks for reply.

What is the normal performance for Qwen3-32B model on Ascned 910B3 NPUs? I would like some step-by-step tuning advices for improving the performance, thanks very much.

It seems likely that there are no official or community-published benchmarks for Qwen3-32B on Ascend 910B3 NPUs in the provided sources, so “normal” performance is not clearly defined. Most reports indicate that vLLM-Ascend is still catching up to Nvidia GPU performance, and multi-card scaling is under development (vllm-ascend PR, vllm-ascend repo). Users have not shared specific tokens/s numbers for 910B3, but your observed 10 tokens/s is considered low compared to Nvidia A10 results.

Here are step-by-step tuning suggestions based on available documentation and user feedback:

  1. Update vllm-ascend: Use the latest code from vllm-ascend, as performance improvements are ongoing.
  2. Tensor Parallel Size: Ensure --tensor-parallel-size matches your available NPUs and is supported by your hardware.
  3. Memory Utilization: Adjust --gpu-memory-utilization (e.g., try 0.9 or 0.95) to maximize NPU memory usage.
  4. Batch Size: Increase --max-num-seqs if memory allows, as higher batch sizes can improve throughput.
  5. Model Context Length: Lower --max-model-len if you do not need long contexts, as this reduces memory pressure.
  6. Ascend Environment Variables: Set PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 and other recommended variables from the PR discussion.
  7. Profile and Monitor: Use npu-smi and logs to check NPU utilization and memory usage.
  8. Report Issues: If performance is still low, open an issue in the vllm-ascend repo with detailed logs and environment info for targeted help.

Would you like more detail on any of these steps?

Sources:

@kyle Thanks for your info, for qwen3

2 Likes

There are no official Qwen3-32B benchmarks for Ascend 910B3, but vLLM-Ascend v0.7.3-dev adds Qwen3 support and includes a performance tuning guide. For v0.8/0.9, graph mode optimizations are in progress and will be released soon (issue #767). You can try the latest dev images for better performance.

Would you like a step-by-step summary of the optimization guide for vLLM-Ascend?

Sources: