Max_tokens_per_doc support for rerank models

hustxiayang · January 21, 2026, 6:39pm

@RunLLM it seems that in vllm, rerank models do not support max_tokens_per_doc, but instead truncate_prompt_tokens, so vLLM can not truncate each doc, but truncate the whole context, right?

RunLLM · January 21, 2026, 6:39pm

Correct. In vLLM, there is no max_tokens_per_doc parameter for rerank models; instead, you can use truncate_prompt_tokens, which truncates the entire input prompt (i.e., the whole context), not each document individually. This means vLLM does not natively support truncating each document within a batch for rerank models—truncation applies to the total input sequence only, not per-document granularity. This limitation is discussed in detail in related issues and code references, confirming that per-document truncation is not currently implemented in vLLM for rerank tasks, and any such logic would need to be handled client-side or by preprocessing inputs before sending to vLLM[^1^][^2^][^3^].

Would you like more detail or example code for handling per-document truncation outside vLLM?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Does vLLM support token truncation? General	5	1267	September 10, 2025
Mixedbread reranker on vLLM `/score`: scores differ vs local Mixedbread; small payload = same order/different scores, large payload = different order Benchmarking	1	84	August 15, 2025
Pass instructions to Qwen Embedding / Reranker via OpenAI-compatible server? Features	5	816	September 11, 2025
How to extend the context length up to 1,010,000 tokens on Qwen3.5? Model Support	2	255	May 4, 2026
Question about parameter: max-num-batched-tokens General	3	2215	November 25, 2025

Max_tokens_per_doc support for rerank models

Related topics