Prompt_embeds usage in vllm openai completion api

ijs · June 15, 2025, 5:05pm

Hi, I’m trying to validate support for embeds usage as prompts.

To verify this I’ve been utilizing a small model like facebook/opt-125m, giving it a textual prompt, mapping the prompt to corresponding embeddings, and verifying that locally the 2 prompts generate results that make are similar or “equal”.

When I try to utilize vllm’s v1/completion api, I’m getting results that don’t relate, i.e. I I ust be doing something wrong somewhere. I do get a “valid” response, in the sense that I’m calling the apis correctly, however the embed prompts are not being applied the same way to the model by vllm, as it would locally when I used the model with HF libs directly.

I’m looking for pointers to check things that could be causing this discrepancy.

RunLLM · June 15, 2025, 5:05pm

Prompt embeddings can be used as input in vLLM, but you must ensure that the embeddings are generated and passed in the exact same way as with Hugging Face Transformers. This includes using the same tokenizer, applying the correct chat template (if any), and matching all preprocessing steps. For vLLM, prompt embeddings should be passed as base64-encoded torch tensors via the “prompt_embeds” key in the JSON payload when using the OpenAI-compatible API, as shown in the official example. Discrepancies often arise if the chat template is not applied or if the embeddings are not generated identically to the HF pipeline (see here).

Additionally, ensure that the model is started with --enable-prompt-embeds and that you are not mixing text and embedding prompts in a way that changes the order of results. If you are using a chat model, always apply the chat template before generating embeddings (see docs). If you skip any of these steps, the outputs may differ from direct Hugging Face inference.

Would you like a step-by-step example for your use case?

Sources:

DarkLight1337 · June 16, 2025, 2:50am

Please open a bug report on GitHub if you find any problems so I can ping those in charge of this feature.

ijs · June 17, 2025, 1:14pm

I just created the following usage isssue: [Usage]: embed prompts · Issue #19746 · vllm-project/vllm · GitHub

RunLLM · June 17, 2025, 1:15pm

Thank you for sharing your GitHub issue (#19746). This is the correct place to report and track problems with embedding prompts in vLLM. The maintainers and community can now review your detailed environment and reproduction steps to help diagnose the discrepancy between vLLM and Hugging Face embedding prompt behavior.

If you need further troubleshooting or want to discuss possible causes here, let me know.

Sources:

#19746

Topic		Replies	Views
How to start embedding models General	7	339	March 20, 2025
Does the vLLM embedding model support inputting input_ids General	21	42	September 2, 2025
How to do inference of BGE-m3 embedding with vllm General	1	149	August 27, 2025
Multimodal inference guideline? General	59	885	August 6, 2025
vLLM output vs Ollama General	8	431	April 10, 2025

Prompt_embeds usage in vllm openai completion api

Related topics