Hi, I’m trying to validate support for embeds usage as prompts.
To verify this I’ve been utilizing a small model like facebook/opt-125m, giving it a textual prompt, mapping the prompt to corresponding embeddings, and verifying that locally the 2 prompts generate results that make are similar or “equal”.
When I try to utilize vllm’s v1/completion api, I’m getting results that don’t relate, i.e. I I ust be doing something wrong somewhere. I do get a “valid” response, in the sense that I’m calling the apis correctly, however the embed prompts are not being applied the same way to the model by vllm, as it would locally when I used the model with HF libs directly.
I’m looking for pointers to check things that could be causing this discrepancy.
Prompt embeddings can be used as input in vLLM, but you must ensure that the embeddings are generated and passed in the exact same way as with Hugging Face Transformers. This includes using the same tokenizer, applying the correct chat template (if any), and matching all preprocessing steps. For vLLM, prompt embeddings should be passed as base64-encoded torch tensors via the “prompt_embeds” key in the JSON payload when using the OpenAI-compatible API, as shown in the official example. Discrepancies often arise if the chat template is not applied or if the embeddings are not generated identically to the HF pipeline (see here).
Additionally, ensure that the model is started with --enable-prompt-embeds and that you are not mixing text and embedding prompts in a way that changes the order of results. If you are using a chat model, always apply the chat template before generating embeddings (see docs). If you skip any of these steps, the outputs may differ from direct Hugging Face inference.
Would you like a step-by-step example for your use case?
Thank you for sharing your GitHub issue (#19746). This is the correct place to report and track problems with embedding prompts in vLLM. The maintainers and community can now review your detailed environment and reproduction steps to help diagnose the discrepancy between vLLM and Hugging Face embedding prompt behavior.
If you need further troubleshooting or want to discuss possible causes here, let me know.