Yes, vLLM supports token truncation, but the support depends on the API and context. For the OpenAI-compatible completion and embedding APIs, you can use the truncate_prompt_tokens parameter to truncate the prompt to the last k tokens (left truncation). For example, in the OpenAI embedding API, you can specify truncate_prompt_tokens=10 to use only the last 10 tokens of the prompt. However, this parameter is not universally supported in all vLLM offline or engine APIs, and for some cases (like the offline LLM engine or AsyncLLM), you may need to manually truncate the prompt before passing it to vLLM, as truncation is not implemented there yet. Right truncation (keeping the first k tokens) is not natively supported—only left truncation is available via this parameter. See the code example below for OpenAI embedding API usage:
# Example for OpenAI embedding API
response = await client.post(
path="embeddings",
cast_to=object,
body={
"model": "your-model-name",
"input": "your input text",
"truncate_prompt_tokens": 10
}
)
For offline usage, you must tokenize and truncate the prompt yourself before passing it to the engine, as truncate_prompt_tokens is not available in the offline API. If you try to use it in unsupported contexts, vLLM will raise an error or ignore the parameter. For more details and limitations, see the linked sources.
Would you like more detail or examples for a specific API or use case?
Sources:
- Add option to completion API to truncate prompt tokens
- Does vLLM support right truncation?
- truncate_prompt_tokens in SamplingParams only available for openai entrypoints, not for offline vLLM engine
- Bug: SamplingParams’s truncate_prompt_tokens not working in AsyncLLM Engine
- test_truncation_control.py example
- test_truncation.py example