Sparse Embedding Support

Ranjith · June 24, 2026, 7:22am

@RunLLM , Does latest version of vLLM support sparse embeddings when I serve BGE-M3 using vLLM?

RunLLM · June 24, 2026, 7:22am

Yes, the latest version of vLLM supports sparse embeddings for BGE-M3. To enable this, you must override the architecture when serving the model:

vllm serve BAAI/bge-m3 --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'

Then, you can obtain sparse embeddings by sending a request with "task": "token_classify" to the /pooling endpoint. Note that the output will be a list of token scores, and you may need to call /tokenize to pair tokens with their scores if you need token text as well. See official docs and implementation details.

Would you like more detail or example code?

Sources:

Hint: Mention @Herald in the post for followups.

Ranjith · June 24, 2026, 11:10am

@RunLLM Can I do --pooler-config.task embed&token_classify?

Topic		Replies	Views
The service carrying capacity and computing efficiency issues of vllm General	0	42	July 16, 2025
[Questions]Is there a plan to support the rerank model and embedding model Ascend Support	3	963	November 20, 2025
How to use prompt (prompt and prompt_name) for EmbeddingGemma using vllm General	1	114	February 4, 2026
How to do inference of BGE-m3 embedding with vllm General	1	923	August 27, 2025
Text to speech support with /v1/audio/speech route Model Support	1	738	November 28, 2025

Sparse Embedding Support

Related topics