Output truncated without reason

shaojun0 · July 30, 2025, 7:26am

system info

key	value
transformers version	4.51.3
PyTorch version	2.6.0
vllm version	0.8.0
GPU	4090*1

Reproduction

I tried running this model, but the output was always staged and displayed as a length stage, but there was no problem with the configuration.

config

python3 -m vllm.entrypoints.openai.api_server --max_model_len 4096 --served-model-name seed-x --model /data/models/Seed-X-PPO-7B

request json

{
        "model": "seed-x",
        "max_token":4096,
        "prompt": "Translate the following English sentence into Chinese:\nThe image captures a charming scene featuring a small grey mouse with a white belly. The mouse is perched on a blue surface, which could be a table or a piece of furniture. It's engaged in the delightful activity of nibbling on a vibrant red strawberry, which appears juicy and ripe. The mouse's attention is fully absorbed by its sweet treat, suggesting it's enjoying its meal. The background is blurred but hints at a room with white walls, possibly indicating an indoor setting. The overall scene exudes a sense of tranquility and simplicity.<zh>"
}

RunLLM · July 30, 2025, 7:27am

Your output is being truncated due to the max_token parameter (should be max_tokens) and possibly the model’s maximum context length. The correct parameter for vLLM’s OpenAI-compatible API is max_tokens, not max_token. Also, the sum of input and output tokens must not exceed the model’s max_model_len (4096 in your config). If the prompt is long, the model may stop early with finish_reason: "length" even if you set a high max_tokens value. See this issue and this issue for details.

To fix, use "max_tokens" in your request, and ensure your prompt plus max_tokens does not exceed 4096 tokens. If you still see truncation, try reducing prompt length or increasing max_model_len if your GPU has enough memory. Would you like a step-by-step example or more troubleshooting tips?

Sources:

shaojun0 · July 30, 2025, 7:37am

Thank you very much…….

Topic		Replies	Views
Does vLLM support token truncation? General	5	912	September 10, 2025
Why does the size of token_ids generated by the large model during vllm inference exceed the length of the vocabulary? General	7	181	September 15, 2025
Multimodal inference guideline? General	59	2077	August 6, 2025
能讓輸入上下文比最大上下文更長仍舊能工作嗎? General	2	83	December 3, 2025
vLLM extremely slow / no response with max_model_len=8192 and multi-GPU tensor parallel KV-Cache	1	579	October 26, 2025

Output truncated without reason

system info

Reproduction

config

request json

Related topics