Pass instructions to Qwen Embedding / Reranker via OpenAI-compatible server?

sylar · September 8, 2025, 3:40pm

Hi everybody,

I saw on HF that Qwen3-Embedding and Qwen3-Reranker support an instruction field.

When running them with the openai compatible server, is it possible to send this instruction somehow in the request body?
If it’s possible only through a code customization, could someone please point me to where in the codebase I should look to implement this?

Thanks very much!

RunLLM · September 8, 2025, 3:41pm

Currently, the OpenAI-compatible server in vLLM does not natively support passing an instruction field for Qwen3-Embedding or Qwen3-Reranker models directly in the request body. You must manually format the instruction into your prompt or input on the client side before sending the request. This is confirmed in both the documentation and community discussions: “this is currently not handled inside vLLM, so you have to apply the template from client side” and “You can manually format the instruction when calling the API” (source, source).

If you want to add native support for an instruction field, you would need to customize the server code. The relevant place to start would be the OpenAI API server implementation, specifically where the request body is parsed and prompts are constructed (e.g., in vllm/entrypoints/openai/api_server.py). You would need to modify the server to accept an instruction parameter and inject it into the prompt formatting logic before passing it to the model (source).

Would you like a step-by-step guide on how to implement this customization?

Sources:

sylar · September 8, 2025, 4:08pm

What’s the exact format I should follow for Qwen3-Reranker? Could you share a simple example

RunLLM · September 8, 2025, 4:09pm

For Qwen3-Reranker, you should use a specific template to format both the query and document for best results. The recommended format is:

Query template:
{prefix}<Instruct>: {instruction}\n<Query>: {query}\n
Document template:
<Document>: {doc}{suffix}

Where:

prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
instruction is typically: "Given a web search query, retrieve relevant passages that answer the query"

Example:

prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
instruction = "Given a web search query, retrieve relevant passages that answer the query"
query = "What is the capital of China?"
doc = "The capital of China is Beijing."

formatted_query = f"{prefix}<Instruct>: {instruction}\n<Query>: {query}\n"
formatted_doc = f"<Document>: {doc}{suffix}"

You should send formatted_query and formatted_doc as your inputs (reference, reference).

Would you like more detail or a full request example?

Sources:

faileon · September 11, 2025, 11:25am

I think this feature would be a nice addition to the OpenAI server - an ability to pass additional args to the embeddings models. It’s not rare that embedding models require a specific instruction - for example the new Embeddings Gemma has the same as outlined in their docs. Sure one might do it in their application layer, but having the possibility to pass it to the model via the API would be nice.

RunLLM · September 11, 2025, 11:26am

Currently, the OpenAI-compatible server in vLLM does not support passing additional arguments (such as an instruction field) directly to embedding models via the API. You must manually format the instruction into your input before sending the request (source, source). There is community interest in this feature, and related discussions/feature requests are open, but it is not yet implemented.

Would you like guidance on how to contribute this feature or details on workarounds?

Sources:

Topic		Replies	Views
Deployment example for a qwen3 model with hybrid thinking Reasoning Outputs	11	966	February 4, 2026
Text to speech support with /v1/audio/speech route Model Support	1	283	November 28, 2025
Mixedbread reranker on vLLM `/score`: scores differ vs local Mixedbread; small payload = same order/different scores, large payload = different order Benchmarking	1	51	August 15, 2025
Qwen 3 VL System Prompt General	3	1353	October 15, 2025
Using openai compatible with `beta.chat.completions.parse` can't do tool call and structured output together General	0	100	April 6, 2025

Pass instructions to Qwen Embedding / Reranker via OpenAI-compatible server?

Related topics