能讓輸入上下文比最大上下文更長仍舊能工作嗎?

能讓輸入上下文比最大上下文更長仍舊能工作嗎?總是產生錯誤:

(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] Error in preprocessing prompt inputs
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] Traceback (most recent call last):
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py”, line 235, in create_chat_completion
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] ) = await self._preprocess_chat(
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py”, line 1133, in _preprocess_chat
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] prompt_inputs = await self._tokenize_prompt_input_async(
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py”, line 990, in _tokenize_prompt_input_async
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] async for result in self._tokenize_prompt_inputs_async(
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py”, line 1011, in _tokenize_prompt_inputs_async
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] yield await self._normalize_prompt_text_to_input(
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py”, line 881, in _normalize_prompt_text_to_input
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] return self._validate_input(request, input_ids, input_text)
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py”, line 970, in _validate_input
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] raise ValueError(
(APIServer pid=1) ERROR 12-03 00:16:55 [serving_chat.py:257] ValueError: ‘max_tokens’ or ‘max_completion_tokens’ is too large: 21333. This model’s maximum context length is 196608 tokens and your request has 175808 input tokens (21333 > 196608 - 175808).

vLLM 不支持输入总 token 数超过模型最大 context length(max_model_len),否则会直接报错并拒绝请求。报错内容如你所见,说明“max_tokens 或 max_completion_tokens 太大”,即输入 token 数加上生成 token 数超过了模型的最大 context window,必须减少输入或生成长度才能继续推理。据 serving_engine.py 源码官方文档,这是硬性限制,无法绕过。

如需处理超长输入,可考虑手动截断输入、减少历史消息轮数,或分段处理。vLLM 不会自动截断超长输入,也不会自动缩减 max_tokens 参数,超限请求只能报错。相关 issue 讨论 也明确指出,需自行保证输入+输出 token 总数不超过 max_model_len。

需要详细原理或代码示例吗?

Sources:

Hint: Mention @RunLLM in the post for followups.

vllm官方的openai 服務器沒辦法處理嗎?