Output truncated without reason

system info

key value
transformers version 4.51.3
PyTorch version 2.6.0
vllm version 0.8.0
GPU 4090*1

Reproduction

I tried running this model, but the output was always staged and displayed as a length stage, but there was no problem with the configuration.

config

python3 -m vllm.entrypoints.openai.api_server --max_model_len 4096 --served-model-name seed-x --model /data/models/Seed-X-PPO-7B

request json

{
        "model": "seed-x",
        "max_token":4096,
        "prompt": "Translate the following English sentence into Chinese:\nThe image captures a charming scene featuring a small grey mouse with a white belly. The mouse is perched on a blue surface, which could be a table or a piece of furniture. It's engaged in the delightful activity of nibbling on a vibrant red strawberry, which appears juicy and ripe. The mouse's attention is fully absorbed by its sweet treat, suggesting it's enjoying its meal. The background is blurred but hints at a room with white walls, possibly indicating an indoor setting. The overall scene exudes a sense of tranquility and simplicity.<zh>"
}

Your output is being truncated due to the max_token parameter (should be max_tokens) and possibly the model’s maximum context length. The correct parameter for vLLM’s OpenAI-compatible API is max_tokens, not max_token. Also, the sum of input and output tokens must not exceed the model’s max_model_len (4096 in your config). If the prompt is long, the model may stop early with finish_reason: "length" even if you set a high max_tokens value. See this issue and this issue for details.

To fix, use "max_tokens" in your request, and ensure your prompt plus max_tokens does not exceed 4096 tokens. If you still see truncation, try reducing prompt length or increasing max_model_len if your GPU has enough memory. Would you like a step-by-step example or more troubleshooting tips?

Sources:

1 Like

Thank you very much…….