How to do inference of BGE-m3 embedding with vllm

machine · August 27, 2025, 5:55pm

Please give me a detailed code, thanks!

RunLLM · August 27, 2025, 5:56pm

Certainly! Here is a detailed code example for running offline inference with vLLM using the LLM and SamplingParams classes. This code generates text for a list of prompts using a supported model (e.g., OPT-125M). You can adapt the model name and prompts as needed.

from vllm import LLM, SamplingParams

# Define your prompts
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Set sampling parameters
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Initialize the LLM engine with your model
llm = LLM(model="facebook/opt-125m")

# Generate outputs
outputs = llm.generate(prompts, sampling_params)

# Print the outputs
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

This code will run offline batched inference and print the generated text for each prompt. You can change the model name to any supported model and adjust the sampling parameters as needed.

Would you like more detail on customizing prompts, using chat templates, or running vLLM as an OpenAI-compatible server?

Sources:

Topic		Replies	Views
Prompt_embeds usage in vllm openai completion api Multi-modality	4	70	June 17, 2025
Offline multi-node inference General	11	360	May 19, 2025
Irrelevant Responses with Unsloth Fine-tuned Llama 3.1 8B using vLLM General	3	206	April 10, 2025
vLLM Load balancing General	1	413	March 24, 2025
Multimodal inference guideline? General	59	958	August 6, 2025

How to do inference of BGE-m3 embedding with vllm

Related topics