Does the vLLM embedding model support inputting input_ids

kakascode · September 2, 2025, 11:18am

Does the vLLM embedding model support inputting input_ids

RunLLM · September 2, 2025, 11:18am

Yes, vLLM embedding models support inputting input_ids as input. The standard interface for embedding models in vLLM expects text input in the form of input_ids (token IDs), which are then converted to embeddings internally. Direct embedding input (inputs_embeds) is not generally supported for text-only models; you should use input_ids for embedding tasks in vLLM, as confirmed in multiple discussions and documentation entries (source, source).

Would you like more detail or code examples?

Sources:

kakascode · September 2, 2025, 11:19am

please give me some examples

RunLLM · September 2, 2025, 11:20am

以 vLLM 的 embedding 模型为例，输入 input_ids（即文本字符串）进行嵌入的标准用法如下。你可以通过 LLM.embed 方法传入字符串，vLLM 会自动将其转为 input_ids 并输出嵌入向量。例如：

from vllm import LLM
llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
(output,) = llm.embed("Hello, my name is")
embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")

此方式无需手动传递 input_ids，vLLM 会自动处理字符串到 token ids 的转换，适用于所有支持 embedding 的模型。参考

需要更详细的说明或其他用法吗？

Sources:

Pooling Models — vLLM

kakascode · September 2, 2025, 11:21am

In this context, “inputs_ids” refers to token IDs.

RunLLM · September 2, 2025, 11:21am

Yes, in this context, “input_ids” refers to token IDs—numerical representations of tokens produced by a tokenizer from input text. These IDs are the standard input for vLLM embedding models, which then convert them to embeddings internally (Pooling Models — vLLM).

Would you like more detail?

Sources:

Pooling Models — vLLM

kakascode · September 2, 2025, 11:22am

Does the vLLM embedding model support inputting token_ids，give me some example

RunLLM · September 2, 2025, 11:22am

Yes, the vLLM embedding model supports inputting input_ids (token IDs). You can provide text strings as input, and vLLM will tokenize them into input_ids internally. Here’s a simple example using the LLM.embed API:

from vllm import LLM
llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
(output,) = llm.embed("A small, charming house with a red roof sits on a tiny island surrounded by calm waters.")
embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")

This code takes your text, converts it to token IDs (input_ids), and returns the embedding vector. You do not need to manually provide token IDs; vLLM handles the conversion from text to input_ids automatically (Pooling Models — vLLM).

Would you like more detail or examples for other input types?

Sources:

Pooling Models — vLLM

kakascode · September 2, 2025, 11:25am

Does the vLLM embedding model support inputting {‘input_ids’: tensor([[ 3, 6, 124084]], device=‘cuda:0’), ‘attention_mask’: tensor([[1, 1, 1]], device=‘cuda:0’)}

RunLLM · September 2, 2025, 11:25am

No, the vLLM embedding model does not support directly inputting a dictionary like {‘input_ids’: tensor(…), ‘attention_mask’: tensor(…)} as input. vLLM expects text strings or lists of strings as input, and it handles tokenization internally. Directly passing PyTorch tensors or such dictionaries is not supported in the public API (Pooling Models — vLLM).

Would you like more detail or guidance on supported input formats?

Sources:

kakascode · September 2, 2025, 11:31am

如果是调用vllm自定义的forward类呢，不去调用已有的方法

RunLLM · September 2, 2025, 11:32am

vLLM 的 pooling models（如 embedding、reward、classification）默认不会直接调用你自定义的 forward 方法，而是通过实现 VllmModelForPooling 接口，并调用 pooler 方法来提取隐藏状态。如果你想让 vLLM 调用自定义的 forward 方法，需要确保你的模型类实现了 vLLM 规定的接口（如 VllmModelForPooling），并在接口方法内部调用你的自定义 forward 逻辑。官方文档说明

如需详细实现方式或代码示例，请告知。

Sources:

kakascode · September 2, 2025, 11:32am

我想直接用Llm.llm-engine.model-executor.driver-worker.model-runner.model去forward，请给我一些示例

RunLLM · September 2, 2025, 11:33am

在 vLLM v1 架构下，无法直接通过 llm.llm_engine.model_executor.driver_worker.model_runner.model 访问模型实例，因为模型运行在独立进程。官方推荐用 collective_rpc 方法在 worker 进程中执行自定义函数。例如：

from vllm import LLM

def echo_model_runner(self):
    return self.model_runner.model.__class__

llm = LLM(model="meta-llama/Llama-3.2-1B-Instruct")
ret = llm.collective_rpc(echo_model_runner)
print(ret[0])  # 输出模型类名

collective_rpc 可让所有 worker 执行你自定义的函数。如果你需要 forward，可以在函数体内调用 self.model_runner.model.forward(…) 并返回结果。参考

需要更详细的 forward 调用示例吗？

Sources:

Bug: llm.llm_engine.model_executor.driver_worker.model_runner.model is not accessible when using vLLM v1 engine

kakascode · September 2, 2025, 11:35am

请你举例，要求实现的更详细一点，这里我用的不是V1架构

RunLLM · September 2, 2025, 11:35am

如果你用的不是 vLLM v1 架构（即 v0.x 版本），可以直接通过 llm.llm_engine.model_executor.driver_worker.model_runner.model 访问模型实例，并调用 forward 方法。例如，假设你用的是 LlamaForCausalLM，可以这样：

from vllm import LLM
import torch

llm = LLM(model="meta-llama/Llama-2-7b-hf")
# 获取模型实例
model = llm.llm_engine.model_executor.driver_worker.model_runner.model

# 构造输入
input_ids = torch.tensor([[1, 2, 3]]).cuda()
positions = torch.arange(0, input_ids.shape[1], dtype=torch.long).unsqueeze(0).cuda()

# 调用 forward
with torch.no_grad():
    output = model.forward(input_ids=input_ids, positions=positions)
print(output)

注意：input_ids 和 positions 需放到同一张 GPU 上，且参数名需与模型 forward 方法一致。参考

需要更详细的 forward 参数说明或其他模型示例吗？

Sources:

Custom LLM Generate

kakascode · September 2, 2025, 11:44am

我用的是embedding模型（BGE-M3)，想调用这个方法,请给我一些示例

RunLLM · September 2, 2025, 11:45am

你可以直接通过 llm.llm_engine.model_executor.driver_worker.model_runner.model 获取 BGE-M3 embedding 模型实例，并调用 forward 方法。以 vLLM v0 架构为例，示例如下：

from vllm import LLM
import torch

llm = LLM(model="BAAI/bge-m3")
model = llm.llm_engine.model_executor.driver_worker.model_runner.model

# 构造 input_ids 和 attention_mask
input_ids = torch.tensor([[3, 6, 124084]]).cuda()
attention_mask = torch.tensor([[1, 1, 1]]).cuda()

# forward 方法参数需与模型定义一致
with torch.no_grad():
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
print(outputs)

注意：BGE-M3 属于 encoder-only 架构，forward 方法通常需要 input_ids 和 attention_mask。参数需放在同一 GPU 上。参考

需要更详细的输出说明或如何获取 embedding 吗？

Sources:

how to get the output embedding for a text generation model using vllm

kakascode · September 2, 2025, 12:15pm

AssertionError: Forward context is not set. Please use set_forward_context to set the forward context.这个报错是什么原因

RunLLM · September 2, 2025, 12:16pm

这个报错的原因是：你在直接调用模型的 forward 方法时，没有设置 vLLM 的 forward context。vLLM 内部依赖 set_forward_context 上下文来管理推理时的全局状态，缺失会导致 AssertionError。官方文档说明

如需直接调用 forward，需用 set_forward_context 包裹你的调用，例如：

from vllm.forward_context import set_forward_context

with set_forward_context(...):  # 需传入正确的 attn_metadata 和 vllm_config
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)

具体参数需根据模型和配置而定。是否需要详细参数说明或完整示例？

Sources:

Topic		Replies	Views
How to start embedding models General	7	347	March 20, 2025
Prompt_embeds usage in vllm openai completion api Multi-modality	4	70	June 17, 2025
Why does the size of token_ids generated by the large model during vllm inference exceed the length of the vocabulary? General	7	34	September 15, 2025
Does VLLM support BERT model General	2	89	April 7, 2025
Trace requests through vLLM v1 General	1	86	May 29, 2025

Does the vLLM embedding model support inputting input_ids

Related topics