Does vLLM support token truncation?

For LLM.generate, token truncation via the truncate_prompt_tokens parameter is supported: you can set it in SamplingParams to keep only the last k tokens of the prompt (left truncation). For LLM.embed, you can also use the truncate_prompt_tokens argument to truncate the prompt for embedding models, and setting it to -1 will use the model’s maximum context length. Both APIs do not support right truncation (keeping the first k tokens) natively—only left truncation is available.

Example for LLM.generate:

from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m")
params = SamplingParams(truncate_prompt_tokens=10)
outputs = llm.generate("your long prompt here", params)

Example for LLM.embed:

from vllm import LLM
llm = LLM(model="intfloat/e5-small", runner="pooling")
output = llm.embed("your long prompt here", truncate_prompt_tokens=10)

Would you like more detail or clarification for a specific use case?

Sources: