能讓輸入上下文比最大上下文更長仍舊能工作嗎?

win10ogod · December 3, 2025, 9:05am

能讓輸入上下文比最大上下文更長仍舊能工作嗎?總是產生錯誤:

(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] Error in preprocessing prompt inputs
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] Traceback (most recent call last):
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py”, line 235, in create_chat_completion
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] ) = await self._preprocess_chat(
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py”, line 1133, in _preprocess_chat
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] prompt_inputs = await self._tokenize_prompt_input_async(
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py”, line 990, in _tokenize_prompt_input_async
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] async for result in self._tokenize_prompt_inputs_async(
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py”, line 1011, in _tokenize_prompt_inputs_async
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] yield await self._normalize_prompt_text_to_input(
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py”, line 881, in _normalize_prompt_text_to_input
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] return self._validate_input(request, input_ids, input_text)
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py”, line 970, in _validate_input
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] raise ValueError(
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] ValueError: ‘max_tokens’ or ‘max_completion_tokens’ is too large: 21333. This model’s maximum context length is 196608 tokens and your request has 175808 input tokens (21333 > 196608 - 175808).

RunLLM · December 3, 2025, 9:05am

vLLM 不支持输入总 token 数超过模型最大 context length（max_model_len），否则会直接报错并拒绝请求。报错内容如你所见，说明“max_tokens 或 max_completion_tokens 太大”，即输入 token 数加上生成 token 数超过了模型的最大 context window，必须减少输入或生成长度才能继续推理。据 serving_engine.py 源码和官方文档，这是硬性限制，无法绕过。

如需处理超长输入，可考虑手动截断输入、减少历史消息轮数，或分段处理。vLLM 不会自动截断超长输入，也不会自动缩减 max_tokens 参数，超限请求只能报错。相关 issue 讨论也明确指出，需自行保证输入+输出 token 总数不超过 max_model_len。

需要详细原理或代码示例吗？

Sources:

Hint: Mention @RunLLM in the post for followups.

win10ogod · December 3, 2025, 9:08am

vllm官方的openai 服務器沒辦法處理嗎?

Topic		Replies	Views
How to extend the context length up to 1,010,000 tokens on Qwen3.5? Model Support	2	82	May 4, 2026
Help needed with vLLM Qwen3-32B inference performance tuning (long input slow) General	1	1332	May 26, 2025
Output truncated without reason General	2	693	July 30, 2025
Getting '502 Bad Gateway' on requesting a long prompt with Qwen 3 Next and 4x4090 General	2	270	October 27, 2025
vLLM extremely slow / no response with max_model_len=8192 and multi-GPU tensor parallel KV-Cache	1	935	October 26, 2025

能讓輸入上下文比最大上下文更長仍舊能工作嗎?

Related topics