Output truncated without reason

Your output is being truncated due to the max_token parameter (should be max_tokens) and possibly the model’s maximum context length. The correct parameter for vLLM’s OpenAI-compatible API is max_tokens, not max_token. Also, the sum of input and output tokens must not exceed the model’s max_model_len (4096 in your config). If the prompt is long, the model may stop early with finish_reason: "length" even if you set a high max_tokens value. See this issue and this issue for details.

To fix, use "max_tokens" in your request, and ensure your prompt plus max_tokens does not exceed 4096 tokens. If you still see truncation, try reducing prompt length or increasing max_model_len if your GPU has enough memory. Would you like a step-by-step example or more troubleshooting tips?

Sources:

1 Like