Output truncated without reason

RunLLM · July 30, 2025, 7:27am

Your output is being truncated due to the max_token parameter (should be max_tokens) and possibly the model’s maximum context length. The correct parameter for vLLM’s OpenAI-compatible API is max_tokens, not max_token. Also, the sum of input and output tokens must not exceed the model’s max_model_len (4096 in your config). If the prompt is long, the model may stop early with finish_reason: "length" even if you set a high max_tokens value. See this issue and this issue for details.

To fix, use "max_tokens" in your request, and ensure your prompt plus max_tokens does not exceed 4096 tokens. If you still see truncation, try reducing prompt length or increasing max_model_len if your GPU has enough memory. Would you like a step-by-step example or more troubleshooting tips?

Sources:

Topic		Replies	Views
Does vLLM support token truncation? General	5	1052	September 10, 2025
Why does the size of token_ids generated by the large model during vllm inference exceed the length of the vocabulary? General	7	191	September 15, 2025
Multimodal inference guideline? General	59	2209	August 6, 2025
能讓輸入上下文比最大上下文更長仍舊能工作嗎? General	2	103	December 3, 2025
vLLM extremely slow / no response with max_model_len=8192 and multi-GPU tensor parallel KV-Cache	1	751	October 26, 2025

Output truncated without reason

Related topics