Does vLLM support token truncation? give me some examples
Yes, vLLM supports token truncation, but the support depends on the API and context. For the OpenAI-compatible completion and embedding APIs, you can use the truncate_prompt_tokens parameter to truncate the prompt to the last k tokens (left truncation). For example, in the OpenAI embedding API, you can specify truncate_prompt_tokens=10 to use only the last 10 tokens of the prompt. However, this parameter is not universally supported in all vLLM offline or engine APIs, and for some cases (like the offline LLM engine or AsyncLLM), you may need to manually truncate the prompt before passing it to vLLM, as truncation is not implemented there yet. Right truncation (keeping the first k tokens) is not natively supported—only left truncation is available via this parameter. See the code example below for OpenAI embedding API usage:
# Example for OpenAI embedding API
response = await client.post(
path="embeddings",
cast_to=object,
body={
"model": "your-model-name",
"input": "your input text",
"truncate_prompt_tokens": 10
}
)
For offline usage, you must tokenize and truncate the prompt yourself before passing it to the engine, as truncate_prompt_tokens is not available in the offline API. If you try to use it in unsupported contexts, vLLM will raise an error or ignore the parameter. For more details and limitations, see the linked sources.
Would you like more detail or examples for a specific API or use case?
Sources:
- Add option to completion API to truncate prompt tokens
- Does vLLM support right truncation?
- truncate_prompt_tokens in SamplingParams only available for openai entrypoints, not for offline vLLM engine
- Bug: SamplingParams’s truncate_prompt_tokens not working in AsyncLLM Engine
- test_truncation_control.py example
- test_truncation.py example
If I only use LLM.generate or LLM.embed, does it support truncation?
For LLM.generate, token truncation via the truncate_prompt_tokens parameter is supported: you can set it in SamplingParams to keep only the last k tokens of the prompt (left truncation). For LLM.embed, you can also use the truncate_prompt_tokens argument to truncate the prompt for embedding models, and setting it to -1 will use the model’s maximum context length. Both APIs do not support right truncation (keeping the first k tokens) natively—only left truncation is available.
Example for LLM.generate:
from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m")
params = SamplingParams(truncate_prompt_tokens=10)
outputs = llm.generate("your long prompt here", params)
Example for LLM.embed:
from vllm import LLM
llm = LLM(model="intfloat/e5-small", runner="pooling")
output = llm.embed("your long prompt here", truncate_prompt_tokens=10)
Would you like more detail or clarification for a specific use case?
Sources:
[rank0]: TypeError: LLM.embed() got an unexpected keyword argument ‘truncate_prompt_tokens’
The error LLM.embed() got an unexpected keyword argument 'truncate_prompt_tokens' means that in your vLLM version, LLM.embed() does not accept the truncate_prompt_tokens argument directly. In some versions, truncation must be handled by tokenizing and truncating the prompt yourself before passing it to LLM.embed(), or by using the PoolingParams or tokenization_kwargs if supported. This limitation is documented in vLLM issues and PRs—direct support for truncate_prompt_tokens in LLM.embed() was only added in recent updates, and may not be available in your installed version.
To resolve this, either update vLLM to the latest version where this argument is supported, or manually truncate your prompt before calling LLM.embed(). See this issue and this PR for more context.
Would you like step-by-step instructions for manual truncation or upgrading vLLM?
Sources: