@RunLLM , Does latest version of vLLM support sparse embeddings when I serve BGE-M3 using vLLM?
Yes, the latest version of vLLM supports sparse embeddings for BGE-M3. To enable this, you must override the architecture when serving the model:
vllm serve BAAI/bge-m3 --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'
Then, you can obtain sparse embeddings by sending a request with "task": "token_classify" to the /pooling endpoint. Note that the output will be a list of token scores, and you may need to call /tokenize to pair tokens with their scores if you need token text as well. See official docs and implementation details.
Would you like more detail or example code?
Sources:
- https://docs.vllm.ai/en/latest/models/pooling_models/specific_models/#__span-15-3
- https://github.com/vllm-project/vllm/pull/14526
Hint: Mention @Herald in the post for followups.
@RunLLM Can I do --pooler-config.task embed&token_classify?