Mixedbread reranker on vLLM `/score`: scores differ vs local Mixedbread; small payload = same order/different scores, large payload = different order

Mixedbread reranker on vLLM /score: scores differ vs local Mixedbread; small payload = same order/different scores, large payload = different order

I’m serving Mixedbread v2 as Qwen2 seq-cls via hf_overrides and calling /score:

vllm serve mixedbread-ai/mxbai-rerank-base-v2 \
  --hf_overrides '{"architectures":["Qwen2ForSequenceClassification"],"classifier_from_token":["0","1"],"method":"from_2_way_softmax"}' \
  --host 0.0.0.0 --port 8000

Env: AWS g6e.xlarge (L40S 48 GB).
Reference: local from mxbai_rerank import MxbaiRerankV2 on Mac M3 Pro (CPU).

Key observations:

  • Scores differ between vLLM and local Mixedbread usage.
  • Small payload (“To Kill a Mockingbird”): same ranking, but vLLM assigns much higher probs to negatives than Mixedbread.
  • Larger payload (10 docs ~1K chars; query “Who wrote the novel 1984?”): ranking order diverges vs Mixedbread.
Small payload — side-by-side (vLLM vs MxbaiRerankV2)
Doc (short) vLLM /score MxbaiRerankV2
Harper Lee (novel) 0.9974413 0.9941708
Harper Lee (bio) 0.9837090 0.9705565
Jane Austen 0.9364685 0.2971900
Harry Potter 0.7970418 0.0699584
Moby-Dick 0.6106243 0.0519386
Gatsby 0.5890393 0.0290230
Large payload (~1K chars/doc, “1984”) — order diverges

Ask: what’s the recommended way to get score parity and stable order (label tokens, pair template, calibration) so vLLM /score matches the Mixedbread library?

For Mixedbread reranker models (e.g., mixedbread-ai/mxbai-rerank-base-v2) served via vLLM with --hf_overrides '{"architectures":["Qwen2ForSequenceClassification"],"classifier_from_token":["0","1"],"method":"from_2_way_softmax"}', differences in score values and ranking order compared to the original Mixedbread library are expected, especially for longer documents. This is due to differences in prompt formatting, tokenization, and possibly numerical precision or batching behavior. For best results, you must use the same query/document templates as the original model—without these, vLLM scores and rankings can be unstable or incorrect. See the official vLLM usage example for the correct templates and formatting.

To maximize parity:

  • Use the provided query_template and document_template to format both query and document inputs before sending to /score.
  • Ensure tokenization and truncation settings match those of the reference implementation.
  • Accept that minor score differences may persist due to hardware, precision, or batching, but ranking should be stable if templates are used.
  • For large payloads, ensure that document length does not exceed model limits, and consider chunking if needed.

Would you like a step-by-step code example for formatting and querying with vLLM?

Sources: