Hello, may I ask if there are any plans for vllm-ascend to support rerank(bge-reranker-v2-m3) and embedding(bge-m3) models
RunLLM
October 16, 2025, 5:13am
2
Yes, vLLM already supports both rerank (e.g., BAAI/bge-reranker-v2-m3) and embedding (e.g., BAAI/bge-m3) models on its mainline, including via OpenAI-compatible endpoints. You can serve these models using commands like vllm serve BAAI/bge-reranker-v2-m3
or vllm serve BAAI/bge-m3
, and use the /score
, /rerank
, or /embeddings
APIs as documented. However, there is no explicit mention in the provided data sources of “vllm-ascend” (Ascend hardware/NPUs) support for these models—current documentation and issues focus on CPU and GPU backends only. If you are referring to Huawei Ascend NPUs, this is not mentioned as supported in the official docs or issues.
Would you like more detail on how to use these models with vLLM on supported hardware, or are you specifically interested in Ascend/NPU support?
Sources:
Hint: Mention @RunLLM in the post for followups.
yikun
October 16, 2025, 12:23pm
3
main
← lianyiibo:pooling_support
opened 07:16AM - 23 Sep 25 UTC
### What this PR does / why we need it?
Support pooling models (like `bge-rer… anker-v2-m3`) in vllm-ascend.
After this [commit](https://github.com/vllm-project/vllm/commit/17373dcd93ca60554d72cef4e159e70abbfd15af), vllm has provided support for adapting pooling models on the v1 engine. This PR includes corresponding adaptations on the vllm-ascend side.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Taking `BAAI/bge-reranker-v2-m3` as an example.
```python
from argparse import Namespace
import os
from vllm import LLM, EngineArgs
from vllm.utils import FlexibleArgumentParser
def parse_args():
parser = FlexibleArgumentParser()
parser = EngineArgs.add_cli_args(parser)
# Set example specific arguments
parser.set_defaults(
model="BAAI/bge-reranker-v2-m3", task="score", enforce_eager=True
)
return parser.parse_args()
def main(args: Namespace):
# Sample prompts.
text_1 = "What is the capital of France?"
texts_2 = [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris.",
]
# Create an LLM.
# You should pass task="score" for cross-encoder models
llm = LLM(**vars(args))
# Generate scores. The output is a list of ScoringRequestOutputs.
outputs = llm.score(text_1, texts_2)
# Print the outputs.
print("\nGenerated Outputs:\n" + "-" * 60)
for text_2, output in zip(texts_2, outputs):
score = output.outputs.score
print(f"Pair: {[text_1, text_2]!r} \nScore: {score}")
print("-" * 60)
if __name__ == "__main__":
args = parse_args()
main(args)
```
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
main
← JC-ut0:feat_bert_v1
opened 03:24AM - 25 Sep 25 UTC
### What this PR does / why we need it?
[Feat] Supports Aclgraph for bge-m3
… ### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
```
pytest -s tests/e2e/singlecard/test_bge_model.py
pytest -s tests/e2e/singlecard/test_embedding_aclgraph.py
```
to start an online server with bs 10, each batch's seq length=8192, we set --max-num-batched-tokens=8192*10 to ensure encoder is not chunked:
```
vllm serve /home/data/bge-m3 --max_model_len 1024 --served-model-name "bge-m3" --task embed --host 0.0.0.0 --port 9095 --max-num-batched-tokens 81920 --compilation-config '{"cudagraph_capture_sizes":[8192, 10240, 20480, 40960, 81920]}'
```
For bs10, each batch's seq length=8192, QPS is improved from 85 to 104, which is a 22% improvement, lots of host bound is reduced.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
See related ^ , V1 support for embedding model will ready soon. (Maybe v0.11.1rc1)