Does vLLM support token truncation?

RunLLM · September 10, 2025, 8:27am

For LLM.generate, token truncation via the truncate_prompt_tokens parameter is supported: you can set it in SamplingParams to keep only the last k tokens of the prompt (left truncation). For LLM.embed, you can also use the truncate_prompt_tokens argument to truncate the prompt for embedding models, and setting it to -1 will use the model’s maximum context length. Both APIs do not support right truncation (keeping the first k tokens) natively—only left truncation is available.

Example for LLM.generate:

from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m")
params = SamplingParams(truncate_prompt_tokens=10)
outputs = llm.generate("your long prompt here", params)

Example for LLM.embed:

from vllm import LLM
llm = LLM(model="intfloat/e5-small", runner="pooling")
output = llm.embed("your long prompt here", truncate_prompt_tokens=10)

Would you like more detail or clarification for a specific use case?

Sources:

Topic		Replies	Views
How to to set sample params default for all requests like "truncate_prompt_tokens" to vllm embedding General	1	57	April 12, 2026
Output truncated without reason General	2	771	July 30, 2025
How to custom end token in vllm serve cli? General	4	193	December 9, 2025
Prompt_embeds usage in vllm openai completion api Multi-modality	4	207	June 17, 2025
Tell me about the current status of the tokenize endpoint in vllm General	4	375	December 8, 2025

Does vLLM support token truncation?

Related topics