I saw on HF that Qwen3-Embedding and Qwen3-Reranker support an instruction field.
When running them with the openai compatible server, is it possible to send this instruction somehow in the request body?
If it’s possible only through a code customization, could someone please point me to where in the codebase I should look to implement this?
Currently, the OpenAI-compatible server in vLLM does not natively support passing an instruction field for Qwen3-Embedding or Qwen3-Reranker models directly in the request body. You must manually format the instruction into your prompt or input on the client side before sending the request. This is confirmed in both the documentation and community discussions: “this is currently not handled inside vLLM, so you have to apply the template from client side” and “You can manually format the instruction when calling the API” (source, source).
If you want to add native support for an instruction field, you would need to customize the server code. The relevant place to start would be the OpenAI API server implementation, specifically where the request body is parsed and prompts are constructed (e.g., in vllm/entrypoints/openai/api_server.py). You would need to modify the server to accept an instruction parameter and inject it into the prompt formatting logic before passing it to the model (source).
Would you like a step-by-step guide on how to implement this customization?
prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
instruction is typically: "Given a web search query, retrieve relevant passages that answer the query"
Example:
prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
instruction = "Given a web search query, retrieve relevant passages that answer the query"
query = "What is the capital of China?"
doc = "The capital of China is Beijing."
formatted_query = f"{prefix}<Instruct>: {instruction}\n<Query>: {query}\n"
formatted_doc = f"<Document>: {doc}{suffix}"
You should send formatted_query and formatted_doc as your inputs (reference, reference).
Would you like more detail or a full request example?
I think this feature would be a nice addition to the OpenAI server - an ability to pass additional args to the embeddings models. It’s not rare that embedding models require a specific instruction - for example the new Embeddings Gemma has the same as outlined in their docs. Sure one might do it in their application layer, but having the possibility to pass it to the model via the API would be nice.
Currently, the OpenAI-compatible server in vLLM does not support passing additional arguments (such as an instruction field) directly to embedding models via the API. You must manually format the instruction into your input before sending the request (source, source). There is community interest in this feature, and related discussions/feature requests are open, but it is not yet implemented.
Would you like guidance on how to contribute this feature or details on workarounds?