Is it possible to perform an additional matrix mulitplication on the embedding outputs (post-embedding layer) on token positions that are dynamic across samples?
Furthermore, this multiplication must be performed on the output tokens during generation stage, ie each new token will have this multiplication performed on top before going through the attn/mlp layers.
I am aware that prompt embeds can be passed as inputs but this will only solve the first problem but not the second. Furthermore ,passing embeddings instead of strings would be slower?
It seems likely that vLLM does not natively support applying a custom matrix multiplication to the embedding outputs at dynamic token positions during both prompt and generation stages. While vLLM allows passing prompt embeddings directly (see Prompt Embedding Inputs), this only affects the initial prompt and not the embeddings for newly generated tokens during autoregressive generation. There is no documented API or hook for modifying embeddings at each generation step before the attention/MLP layers, nor for dynamically applying a transformation at arbitrary positions per sample during generation, based on the available documentation and issues (Prompt Embedding Inputs, github_issue #20404).
Regarding performance, my understanding is that passing prompt embeddings (instead of text) can be less efficient, especially for long prompts, because vLLM’s optimizations (like prefix caching and CUDA graph compilation) are designed for token IDs, not arbitrary embeddings (Prompt Embedding Inputs, github_pull_request #24278). This may result in slower inference when using embeddings as input. Would you like a step-by-step breakdown or suggestions for possible workarounds?