Mixedbread reranker on vLLM `/score`: scores differ vs local Mixedbread; small payload = same order/different scores, large payload = different order

devang-sifthub · August 15, 2025, 11:19am

Mixedbread reranker on vLLM /score: scores differ vs local Mixedbread; small payload = same order/different scores, large payload = different order

I’m serving Mixedbread v2 as Qwen2 seq-cls via hf_overrides and calling /score:

vllm serve mixedbread-ai/mxbai-rerank-base-v2 \
  --hf_overrides '{"architectures":["Qwen2ForSequenceClassification"],"classifier_from_token":["0","1"],"method":"from_2_way_softmax"}' \
  --host 0.0.0.0 --port 8000

Env: AWS g6e.xlarge (L40S 48 GB).
Reference: local from mxbai_rerank import MxbaiRerankV2 on Mac M3 Pro (CPU).

Key observations:

Scores differ between vLLM and local Mixedbread usage.
Small payload (“To Kill a Mockingbird”): same ranking, but vLLM assigns much higher probs to negatives than Mixedbread.
Larger payload (10 docs ~1K chars; query “Who wrote the novel 1984?”): ranking order diverges vs Mixedbread.

Small payload — side-by-side (vLLM vs MxbaiRerankV2)

Doc (short)	vLLM `/score`	MxbaiRerankV2
Harper Lee (novel)	0.9974413	0.9941708
Harper Lee (bio)	0.9837090	0.9705565
Jane Austen	0.9364685	0.2971900
Harry Potter	0.7970418	0.0699584
Moby-Dick	0.6106243	0.0519386
Gatsby	0.5890393	0.0290230

Large payload (~1K chars/doc, “1984”) — order diverges

Query: Who wrote the novel 1984?
10 docs (~1K chars each). Full JSON included in the Issue ([Usage]: vLLM `/score` with Mixedbread reranker (Qwen2 seq-cls override): **scores differ vs local Mixedbread**; small payload = same order/different scores; large payload (\~1K chars/doc) = **order diverges** · Issue #22983 · vllm-project/vllm · GitHub).

Ask: what’s the recommended way to get score parity and stable order (label tokens, pair template, calibration) so vLLM /score matches the Mixedbread library?

RunLLM · August 15, 2025, 11:19am

For Mixedbread reranker models (e.g., mixedbread-ai/mxbai-rerank-base-v2) served via vLLM with --hf_overrides '{"architectures":["Qwen2ForSequenceClassification"],"classifier_from_token":["0","1"],"method":"from_2_way_softmax"}', differences in score values and ranking order compared to the original Mixedbread library are expected, especially for longer documents. This is due to differences in prompt formatting, tokenization, and possibly numerical precision or batching behavior. For best results, you must use the same query/document templates as the original model—without these, vLLM scores and rankings can be unstable or incorrect. See the official vLLM usage example for the correct templates and formatting.

To maximize parity:

Use the provided query_template and document_template to format both query and document inputs before sending to /score.
Ensure tokenization and truncation settings match those of the reference implementation.
Accept that minor score differences may persist due to hardware, precision, or batching, but ranking should be stable if templates are used.
For large payloads, ensure that document length does not exceed model limits, and consider chunking if needed.

Would you like a step-by-step code example for formatting and querying with vLLM?

Sources:

Topic		Replies	Views
Pass instructions to Qwen Embedding / Reranker via OpenAI-compatible server? Features	5	48	September 11, 2025
Speeding up vllm inference for Qwen2.5-VL General	23	4063	June 27, 2025
Deployment example for a qwen3 model with hybrid thinking Reasoning Outputs	7	116	August 26, 2025
vLLM Load balancing General	1	281	March 24, 2025
Init DeepSeek-R1 using Offline Batched Inference General	3	166	May 18, 2025

Mixedbread reranker on vLLM `/score`: scores differ vs local Mixedbread; small payload = same order/different scores, large payload = different order

Related topics