Please give me a detailed code, thanks!
Certainly! Here is a detailed code example for running offline inference with vLLM using the LLM and SamplingParams classes. This code generates text for a list of prompts using a supported model (e.g., OPT-125M). You can adapt the model name and prompts as needed.
from vllm import LLM, SamplingParams
# Define your prompts
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Set sampling parameters
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Initialize the LLM engine with your model
llm = LLM(model="facebook/opt-125m")
# Generate outputs
outputs = llm.generate(prompts, sampling_params)
# Print the outputs
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
This code will run offline batched inference and print the generated text for each prompt. You can change the model name to any supported model and adjust the sampling parameters as needed.
Would you like more detail on customizing prompts, using chat templates, or running vLLM as an OpenAI-compatible server?
Sources: