Custom edit on the embedding out

wj210 · October 15, 2025, 4:27am

Is it possible to perform an additional matrix mulitplication on the embedding outputs (post-embedding layer) on token positions that are dynamic across samples?

Furthermore, this multiplication must be performed on the output tokens during generation stage, ie each new token will have this multiplication performed on top before going through the attn/mlp layers.

I am aware that prompt embeds can be passed as inputs but this will only solve the first problem but not the second. Furthermore ,passing embeddings instead of strings would be slower?

RunLLM · October 15, 2025, 4:27am

It seems likely that vLLM does not natively support applying a custom matrix multiplication to the embedding outputs at dynamic token positions during both prompt and generation stages. While vLLM allows passing prompt embeddings directly (see Prompt Embedding Inputs), this only affects the initial prompt and not the embeddings for newly generated tokens during autoregressive generation. There is no documented API or hook for modifying embeddings at each generation step before the attention/MLP layers, nor for dynamically applying a transformation at arbitrary positions per sample during generation, based on the available documentation and issues (Prompt Embedding Inputs, github_issue #20404).

Regarding performance, my understanding is that passing prompt embeddings (instead of text) can be less efficient, especially for long prompts, because vLLM’s optimizations (like prefix caching and CUDA graph compilation) are designed for token IDs, not arbitrary embeddings (Prompt Embedding Inputs, github_pull_request #24278). This may result in slower inference when using embeddings as input. Would you like a step-by-step breakdown or suggestions for possible workarounds?

Sources:

Hint: Mention @RunLLM in the post for followups.

wj210 · October 15, 2025, 4:34am

so can i take it that it is not possible to perform this matrix multiplication for the newly generated tokens?

Topic		Replies	Views
How to improve tokenization speed for embeddings generation? General	1	781	August 28, 2025
Prompt_embeds usage in vllm openai completion api Multi-modality	4	207	June 17, 2025
Does the vLLM embedding model support inputting input_ids General	21	313	September 2, 2025
How to use prompt (prompt and prompt_name) for EmbeddingGemma using vllm General	1	95	February 4, 2026
How to start embedding models General	7	476	March 20, 2025

Custom edit on the embedding out

Related topics