Does vLLM support token truncation?

kakascode · September 10, 2025, 8:25am

Does vLLM support token truncation? give me some examples

RunLLM · September 10, 2025, 8:25am

Yes, vLLM supports token truncation, but the support depends on the API and context. For the OpenAI-compatible completion and embedding APIs, you can use the truncate_prompt_tokens parameter to truncate the prompt to the last k tokens (left truncation). For example, in the OpenAI embedding API, you can specify truncate_prompt_tokens=10 to use only the last 10 tokens of the prompt. However, this parameter is not universally supported in all vLLM offline or engine APIs, and for some cases (like the offline LLM engine or AsyncLLM), you may need to manually truncate the prompt before passing it to vLLM, as truncation is not implemented there yet. Right truncation (keeping the first k tokens) is not natively supported—only left truncation is available via this parameter. See the code example below for OpenAI embedding API usage:

# Example for OpenAI embedding API
response = await client.post(
    path="embeddings",
    cast_to=object,
    body={
        "model": "your-model-name",
        "input": "your input text",
        "truncate_prompt_tokens": 10
    }
)

For offline usage, you must tokenize and truncate the prompt yourself before passing it to the engine, as truncate_prompt_tokens is not available in the offline API. If you try to use it in unsupported contexts, vLLM will raise an error or ignore the parameter. For more details and limitations, see the linked sources.

Would you like more detail or examples for a specific API or use case?

Sources:

kakascode · September 10, 2025, 8:27am

If I only use LLM.generate or LLM.embed, does it support truncation?

RunLLM · September 10, 2025, 8:27am

For LLM.generate, token truncation via the truncate_prompt_tokens parameter is supported: you can set it in SamplingParams to keep only the last k tokens of the prompt (left truncation). For LLM.embed, you can also use the truncate_prompt_tokens argument to truncate the prompt for embedding models, and setting it to -1 will use the model’s maximum context length. Both APIs do not support right truncation (keeping the first k tokens) natively—only left truncation is available.

Example for LLM.generate:

from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m")
params = SamplingParams(truncate_prompt_tokens=10)
outputs = llm.generate("your long prompt here", params)

Example for LLM.embed:

from vllm import LLM
llm = LLM(model="intfloat/e5-small", runner="pooling")
output = llm.embed("your long prompt here", truncate_prompt_tokens=10)

Would you like more detail or clarification for a specific use case?

Sources:

kakascode · September 10, 2025, 8:30am

[rank0]: TypeError: LLM.embed() got an unexpected keyword argument ‘truncate_prompt_tokens’

RunLLM · September 10, 2025, 8:31am

The error LLM.embed() got an unexpected keyword argument 'truncate_prompt_tokens' means that in your vLLM version, LLM.embed() does not accept the truncate_prompt_tokens argument directly. In some versions, truncation must be handled by tokenizing and truncating the prompt yourself before passing it to LLM.embed(), or by using the PoolingParams or tokenization_kwargs if supported. This limitation is documented in vLLM issues and PRs—direct support for truncate_prompt_tokens in LLM.embed() was only added in recent updates, and may not be available in your installed version.

To resolve this, either update vLLM to the latest version where this argument is supported, or manually truncate your prompt before calling LLM.embed(). See this issue and this PR for more context.

Would you like step-by-step instructions for manual truncation or upgrading vLLM?

Sources:

Topic		Replies	Views
How to to set sample params default for all requests like "truncate_prompt_tokens" to vllm embedding General	1	57	April 12, 2026
Output truncated without reason General	2	771	July 30, 2025
How to custom end token in vllm serve cli? General	4	193	December 9, 2025
Prompt_embeds usage in vllm openai completion api Multi-modality	4	208	June 17, 2025
Tell me about the current status of the tokenize endpoint in vllm General	4	378	December 8, 2025

Does vLLM support token truncation?

Related topics