[Questions]Is there a plan to support the rerank model and embedding model

gbdjxgp · October 16, 2025, 5:13am

Hello, may I ask if there are any plans for vllm-ascend to support rerank(bge-reranker-v2-m3) and embedding(bge-m3) models

RunLLM · October 16, 2025, 5:13am

Yes, vLLM already supports both rerank (e.g., BAAI/bge-reranker-v2-m3) and embedding (e.g., BAAI/bge-m3) models on its mainline, including via OpenAI-compatible endpoints. You can serve these models using commands like vllm serve BAAI/bge-reranker-v2-m3 or vllm serve BAAI/bge-m3, and use the /score, /rerank, or /embeddings APIs as documented. However, there is no explicit mention in the provided data sources of “vllm-ascend” (Ascend hardware/NPUs) support for these models—current documentation and issues focus on CPU and GPU backends only. If you are referring to Huawei Ascend NPUs, this is not mentioned as supported in the official docs or issues.

Would you like more detail on how to use these models with vLLM on supported hardware, or are you specifically interested in Ascend/NPU support?

Sources:

Hint: Mention @RunLLM in the post for followups.

yikun · October 16, 2025, 12:23pm

github.com/vllm-project/vllm-ascend

[Model] Support pooling models

main ← lianyiibo:pooling_support

opened 07:16AM - 23 Sep 25 UTC

lianyiibo

+265 -130

### What this PR does / why we need it? Support pooling models (like `bge-rer…anker-v2-m3`) in vllm-ascend. After this [commit](https://github.com/vllm-project/vllm/commit/17373dcd93ca60554d72cef4e159e70abbfd15af), vllm has provided support for adapting pooling models on the v1 engine. This PR includes corresponding adaptations on the vllm-ascend side. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Taking `BAAI/bge-reranker-v2-m3` as an example. ```python from argparse import Namespace import os from vllm import LLM, EngineArgs from vllm.utils import FlexibleArgumentParser def parse_args(): parser = FlexibleArgumentParser() parser = EngineArgs.add_cli_args(parser) # Set example specific arguments parser.set_defaults( model="BAAI/bge-reranker-v2-m3", task="score", enforce_eager=True ) return parser.parse_args() def main(args: Namespace): # Sample prompts. text_1 = "What is the capital of France?" texts_2 = [ "The capital of Brazil is Brasilia.", "The capital of France is Paris.", ] # Create an LLM. # You should pass task="score" for cross-encoder models llm = LLM(**vars(args)) # Generate scores. The output is a list of ScoringRequestOutputs. outputs = llm.score(text_1, texts_2) # Print the outputs. print("\nGenerated Outputs:\n" + "-" * 60) for text_2, output in zip(texts_2, outputs): score = output.outputs.score print(f"Pair: {[text_1, text_2]!r} \nScore: {score}") print("-" * 60) if __name__ == "__main__": args = parse_args() main(args) ``` - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

github.com/vllm-project/vllm-ascend

[Feat] Supports Aclgraph for bge-m3

main ← JC-ut0:feat_bert_v1

opened 03:24AM - 25 Sep 25 UTC

JC-ut0

+307 -21

### What this PR does / why we need it? [Feat] Supports Aclgraph for bge-m3 …### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ``` pytest -s tests/e2e/singlecard/test_bge_model.py pytest -s tests/e2e/singlecard/test_embedding_aclgraph.py ``` to start an online server with bs 10, each batch's seq length=8192, we set --max-num-batched-tokens=8192*10 to ensure encoder is not chunked: ``` vllm serve /home/data/bge-m3 --max_model_len 1024 --served-model-name "bge-m3" --task embed --host 0.0.0.0 --port 9095 --max-num-batched-tokens 81920 --compilation-config '{"cudagraph_capture_sizes":[8192, 10240, 20480, 40960, 81920]}' ``` For bs10, each batch's seq length=8192, QPS is improved from 85 to 104, which is a 22% improvement, lots of host bound is reduced. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

See related ^ , V1 support for embedding model will ready soon. (Maybe v0.11.1rc1)

Topic		Replies	Views
How to start embedding models General	7	346	March 20, 2025
Question about vLLM and vLLM Ascend verisoning policy Ascend Support	4	234	April 1, 2025
How can vllm ascend support qwen3-vl-235b? Ascend Support	2	27	October 16, 2025
Vllm-ascend是否支持async推理？ Ascend Support	2	10	October 15, 2025
Does VLLM support BERT model General	2	88	April 7, 2025

[Questions]Is there a plan to support the rerank model and embedding model

Related topics