Ray cluster DeepSeek-R1-Distill-Qwen-32B-AWQ

Vladimir · July 23, 2025, 9:17am

for comparison, I tried to run another model with similar parameters
vllm serve /mnt/nfs/models/Llama-2-7B-Chat-GPTQ/ --tensor-parallel-size 4 --host 0.0.0.0 --port 8000 --distributed-executor-backend ray --gpu-memory-utilization 0.9 --max-model-len 4096 --max-num-seqs 8 --max-num-batched-tokens 4096 --block-size 16 --dtype half
RunLLM help me

Error executing method 'load_model'. This might cause deadlock in distributed execution. [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622] Traceback (most recent call last): [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 614, in execute_method [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]     return run_method(self, method, args, kwargs) [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2736, in run_method [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]     return func(*args, **kwargs) [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^ [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model [repeated 6x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]     self.model_runner.load_model() [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]     self.model = get_model(vllm_config=self.vllm_config) [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/model_loader/__init__.py", line 59, in get_model [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]     return loader.load_model(vllm_config=vllm_config, [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]     model = initialize_model(vllm_config=vllm_config, [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 4x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 64, in initialize_model [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]     return model_class(vllm_config=vllm_config, prefix=prefix) [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 1223, in __init__ [repeated 12x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]     self.model = self._init_model(vllm_config=vllm_config, [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 567, in _init_model [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]     return LlamaModel(vllm_config=vllm_config, [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs) [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]     self.start_layer, self.end_layer, self.layers = make_layers( [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]                                                     ^^^^^^^^^^^^ [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 640, in make_layers [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}")) [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 348, in <lambda> [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]     lambda prefix: layer_type(config=config, [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]                    ^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]     self.mlp = LlamaMLP( [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]                ^^^^^^^^^ [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]     self.down_proj = RowParallelLinear( [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]                      ^^^^^^^^^^^^^^^^^^ [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]     self.quant_method.create_weights( [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]   File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/gptq.py", line 148, in create_weights [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622]     raise ValueError( [repeated 2x across cluster]
(RayWorkerWrapper pid=79744, ip=192.168.13.143) ERROR 07-23 09:14:10 [worker_base.py:622] ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size. [repeated 2x across cluster]
[rank0]:[W723 09:14:12.981959726 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "/home/deepseek-fb/vllm_env/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 65, in main
    args.dispatch_function(args)
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 55, in cmd
    uvloop.run(run_server(args))
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1431, in run_server
    await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1451, in run_server_worker
    async with build_async_engine_client(args, client_config) as engine_client:
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/deepseek-fb/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 291, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

Vladimir · July 25, 2025, 2:27pm

Answer RunLLM
Your error, “ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size,” means the tensor parallel size you set (4) is too large for the GPTQ quantized model you are using. GPTQ quantization requires the input size to be divisible by the tensor parallel size, and if not, this error occurs (source).

To fix this, reduce the --tensor-parallel-size to a value that divides the model’s hidden size evenly, or use a non-quantized model. Would you like help determining the correct tensor parallel size for your model?

RunLLM · July 25, 2025, 2:27pm

Да, ваша ошибка “ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.” означает, что выбранный вами параметр --tensor-parallel-size=4 не совместим с архитектурой GPTQ-квантованной модели Llama-2-7B-Chat-GPTQ. Для GPTQ требуется, чтобы размер входа (обычно hidden_size или intermediate_size) делился на tensor_parallel_size без остатка. Если это не так — возникает эта ошибка (подробно, gptq.py).

Решение: уменьшите --tensor-parallel-size до значения, которое делит размер скрытого слоя вашей модели без остатка (например, 1 или 2 для Llama-2-7B-GPTQ), либо используйте не-квантованную модель. Хотите, чтобы я помог подобрать подходящее значение tensor_parallel-size для вашей модели?

Sources:

Vladimir · August 18, 2025, 7:14am

Is it possible to use RAG to mix promt into queries? I want the model to know information from my personal wiki.

RunLLM · August 18, 2025, 7:15am

Да, вы можете использовать RAG (Retrieval-Augmented Generation) для добавления информации из вашей личной вики в запросы к модели. Для этого используйте связку vLLM + LangChain или LlamaIndex: сначала извлекайте релевантные документы из вашей базы знаний, затем формируйте промпт с этими данными и передавайте его в модель через vLLM (подробнее, пример с LangChain).

Сам vLLM не реализует retrieval, но легко интегрируется с инструментами RAG, которые автоматически подмешивают контекст в промпт. Хотите подробную инструкцию или пример кода?

Sources:

Vladimir · August 18, 2025, 7:20am

yes I want
write out sequentially where I should start and what needs to be installed and how to set it up to work with qwen_25-VL-7B

RunLLM · August 18, 2025, 7:21am

Для запуска RAG с вашей вики и Qwen2.5-VL-7B через vLLM, выполните следующие шаги:

Установите зависимости

pip install -U vllm \
    langchain_milvus langchain_openai \
    langchain_community beautifulsoup4 \
    langchain-text-splitters

(источник)

Запустите vLLM сервер с моделью Qwen2.5-VL-7B
```
vllm serve Qwen/Qwen2.5-VL-7B-Instruct --host 0.0.0.0 --port 8000 --dtype=half
```
(источник)
Подготовьте свою вики
Сконвертируйте ваши статьи в текстовые файлы или используйте загрузчик документов LangChain (например, WebBaseLoader, PyPDFLoader и т.д.).
Создайте скрипт RAG
Используйте пример retrieval_augmented_generation_with_langchain.py и замените загрузку документов на вашу вики.
В скрипте укажите:
- embedding_model = “Qwen/Qwen2.5-VL-7B-Instruct”
- chat_model = “Qwen/Qwen2.5-VL-7B-Instruct”
- vllm_api_base = “http://localhost:8000/v1”

Запустите скрипт

python retrieval_augmented_generation_with_langchain.py --url "путь_к_вашей_вики"

Хотите подробный пример кода для загрузки вашей вики?

Sources:

Vladimir · August 18, 2025, 11:29am

both my models
Llama-2-7B-Chat-GPTQ and Qwen2.5-VL-7B-Instruct support embedding?

RunLLM · August 18, 2025, 11:29am

Да, модель Qwen/Qwen2.5-VL-7B-Instruct официально поддерживает задачу embedding (векторизация текста), если запускать её с параметром --task embed (документация). Для Llama-2-7B-Chat-GPTQ поддержка embedding зависит от реализации GPTQ: если модель корректно конвертирована и не ограничена только генерацией, её также можно использовать с --task embed, но официально embedding гарантирован только для обычных Llama-2-7B моделей (таблица моделей).

Хотите пример команды для запуска embedding на одной из этих моделей?

Sources:

Vladimir · August 19, 2025, 10:27am

what endpoint is used in the Qwen2.5-VL-7B-Instruct model
This /v1/chat/completions or some other?

RunLLM · August 19, 2025, 10:27am

Для модели Qwen2.5-VL-7B-Instruct используется стандартный endpoint /v1/chat/completions, как и для других чат-моделей в vLLM (документация). Вы можете отправлять запросы в формате OpenAI Chat API.

Хотите пример запроса или curl-команду?

Sources:

OpenAI Chat Completions API with vLLM

Vladimir · August 25, 2025, 7:05am

I have this script to run RAG
retrieval_augmented_generation_with_langchain.py
import argparse
from argparse import Namespace
from typing import Any, List
import json
import requests

from langchain_community.document_loaders import TextLoader
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_milvus import Milvus
from langchain_openai import OpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter
import torch

from langchain_huggingface import HuggingFaceEmbeddings

def load_and_split_documents(config: dict[str, Any]) → List[Document]:
try:
loader = TextLoader(config[“file_path”])
docs = loader.load()
print(f"DEBUG: Loaded raw documents: {docs}“)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=config[“chunk_size”],
chunk_overlap=config[“chunk_overlap”],
)
chunks = text_splitter.split_documents(docs)
print(f"DEBUG: Split into {len(chunks)} chunks”)
return chunks
except Exception as e:
print(f"Ошибка загрузки документа: {str(e)}")
raise

def init_vectorstore(config: dict[str, Any], documents: list[Document]):
embeddings = HuggingFaceEmbeddings(
model_name=config[“embedding_model”],
model_kwargs={“device”: “cuda” if torch.cuda.is_available() else “cpu”},
encode_kwargs={“batch_size”: 32, “normalize_embeddings”: True}
)

vectorstore = Milvus.from_documents(
    documents=documents,
    embedding=embeddings,
    connection_args={
        "host": config["milvus_host"],
        "port": config["milvus_port"]
    },
    drop_old=True,
)
print(f"DEBUG: Vectorstore initialized with {len(documents)} documents")
return vectorstore

def init_llm(config: dict[str, Any]):
“”“Инициализация LLM через OpenAI для подключения к vLLM серверу”“”
try:
return OpenAI(
openai_api_base=config.get(“vllm_api_base”, “``http://192.168.13.140:8000/v1”``),
openai_api_key=“dummy”,
model_name=config[“chat_model”],
max_tokens=2000,
temperature=0.7,
top_p=0.95
)
except Exception as e:
print(f"Ошибка инициализации LLM: {str(e)}")
raise

def get_qa_prompt():
“”“Промпт для /v1/completions”“”
template = “”“Answer the question based on the context. If the answer is not in the context, say “I don’t know”.
Context: {context}
Question: {question}
Answer: “””
return PromptTemplate.from_template(template)

def format_docs(docs: list[Document]):
if docs is None:
print(“ERROR: No documents provided to format_docs”)
return “”
return “\n\n”.join(doc.page_content for doc in docs)

def create_qa_chain(retriever: Any, llm: Any, prompt: PromptTemplate):
def debug_retriever(question):
docs = retriever.invoke(question)
print(f"DEBUG: Retrieved docs for question ‘{question}’: {docs}")
if docs is None or len(docs) == 0:
print(“WARNING: No documents retrieved!”)
return docs

def debug_llm(input_dict):
    formatted_prompt = prompt.invoke(input_dict)
    print(f"DEBUG: Formatted prompt sent to LLM: {formatted_prompt}")
    try:
        # Прямой запрос к vLLM для отладки
        payload = {
            "model": llm.model_name,
            "prompt": formatted_prompt.text,
            "max_tokens": llm.max_tokens,
            "temperature": llm.temperature,
            "top_p": llm.top_p
        }
        print(f"DEBUG: Request payload to vLLM: {json.dumps(payload, indent=2, ensure_ascii=False)}")
        response = requests.post(
            f"{llm.openai_api_base}/completions",
            json=payload,
            headers={"Content-Type": "application/json"}
        )
        print(f"DEBUG: Raw vLLM response: {response.text}")
        response.raise_for_status()  # Вызывает исключение для HTTP ошибок
        llm_response = llm.invoke(formatted_prompt)
        print(f"DEBUG: LLM response: {llm_response}")
        if llm_response is None:
            print("ERROR: LLM returned None")
            return "LLM returned no response"
        return llm_response
    except Exception as e:
        print(f"ERROR: LLM invocation failed: {str(e)}")
        return f"LLM error: {str(e)}"

return (
    {
        "context": RunnableLambda(debug_retriever) | format_docs,
        "question": RunnablePassthrough(),
    }
    | RunnableLambda(debug_llm)
    | StrOutputParser()
)

def get_parser() → argparse.ArgumentParser:
parser = argparse.ArgumentParser(description=“RAG с vLLM и LangChain”)

parser.add_argument(
    "--file-path", 
    default="result.txt",
    help="Путь к файлу с вики"
)
parser.add_argument(
    "--milvus-host",
    default="localhost",
    help="Хост Milvus"
)
parser.add_argument(
    "--milvus-port",
    default="19530",
    help="Порт Milvus"
)
parser.add_argument(
    "--embedding-model",
    default="sentence-transformers/all-MiniLM-L6-v2",
    help="Модель для эмбеддингов"
)
parser.add_argument(
    "--chat-model", 
    default="/mnt/nfs/models/Qwen2.5-VL-7B-Instruct/", 
    help="Чат-модель"
)
parser.add_argument(
    "--vllm-api-base",
    default="http://192.168.13.140:8000/v1",
    help="Базовый URL для vLLM API"
)
parser.add_argument(
    "-i", "--interactive", 
    action="store_true", 
    help="Интерактивный режим"
)
parser.add_argument(
    "-k", "--top-k", 
    type=int, 
    default=3, 
    help="Количество возвращаемых результатов"
)
parser.add_argument(
    "-c",
    "--chunk-size",
    type=int,
    default=1000,
    help="Размер фрагментов текста",
)
parser.add_argument(
    "-o",
    "--chunk-overlap",
    type=int,
    default=200,
    help="Перекрытие фрагментов текста",
)

return parser

def init_config(args: Namespace):
return {
“file_path”: args.file_path,
“milvus_host”: args.milvus_host,
“milvus_port”: args.milvus_port,
“embedding_model”: args.embedding_model,
“chat_model”: args.chat_model,
“vllm_api_base”: args.vllm_api_base,
“chunk_size”: args.chunk_size,
“chunk_overlap”: args.chunk_overlap,
“top_k”: args.top_k,
}

def main():
args = get_parser().parse_args()
config = init_config(args)

print(f"Подключение к Milvus: {config['milvus_host']}:{config['milvus_port']}")

try:
    # Загрузка и обработка документов
    documents = load_and_split_documents(config)
    print(f"✅ Загружено {len(documents)} фрагментов из вашей вики")

    # Инициализация векторного хранилища
    vectorstore = init_vectorstore(config, documents)
    retriever = vectorstore.as_retriever(search_kwargs={"k": config["top_k"]})
    print("✅ Векторное хранилище инициализировано")

    # Инициализация LLM
    llm = init_llm(config)
    prompt = get_qa_prompt()
    print("✅ Модель и промпт инициализированы")

    # Создание цепочки вопрос-ответ
    qa_chain = create_qa_chain(retriever, llm, prompt)
    print("✅ Цепочка вопрос-ответ готова к работе")

    # Тестовый запрос
    test_question = "О чем эта вики?"
    try:
        output = qa_chain.invoke(test_question)
        print("\nТестовый ответ:")
        print("-" * 50)
        print(output)
        print("-" * 50)
    except Exception as e:
        print(f"❌ Ошибка тестового запроса: {str(e)}")
        return

    # Интерактивный режим
    if args.interactive:
        print("\nДобро пожаловать в интерактивную систему вопросов и ответов!")
        print("Введите 'q' или 'quit' для выхода.")

        while True:
            question = input("\nВаш вопрос: ")
            if question.lower() in ["q", "quit"]:
                print("\nСпасибо за использование! До свидания!")
                break

            try:
                output = qa_chain.invoke(question)
                print(f"\nОтвет: {output}")
            except Exception as e:
                print(f"Ошибка генерации: {str(e)}")

except Exception as e:
    print(f"❌ Критическая ошибка: {str(e)}")
    print("Проверьте настройки и запущенные сервисы")

if name == “main”:
main()

It produces the following result
DEBUG: Raw vLLM response: {“object”:“error”,“message”:“The model does not support Completions API”,“type”:“BadRequestError”,“param”:null,“code”:400}
ERROR: LLM invocation failed: ‘NoneType’ object is not iterable

Test response:

LLM error: ‘NoneType’ object is not iterable
How to fix this?

RunLLM · August 25, 2025, 7:06am

Ваша ошибка:
{“object”:“error”,“message”:“The model does not support Completions API”,…}
означает, что Qwen2.5-VL-7B-Instruct не поддерживает endpoint /v1/completions, а только /v1/chat/completions (документация).

Как исправить:

В функции debug_llm замените endpoint:

response = requests.post(
    f"{llm.openai_api_base}/chat/completions",
    json={
        "model": llm.model_name,
        "messages": [
            {"role": "user", "content": formatted_prompt.text}
        ],
        "max_tokens": llm.max_tokens,
        "temperature": llm.temperature,
        "top_p": llm.top_p
    },
    headers={"Content-Type": "application/json"}
)

Также убедитесь, что используете поле messages, а не prompt, как требует OpenAI Chat API.

Причина:
Qwen2.5-VL-7B-Instruct — чат-модель, и работает только с /v1/chat/completions, а не с /v1/completions (подробнее).

Нужен полный пример исправленного кода?

Sources:

OpenAI Chat Completions API with vLLM

Vladimir · August 25, 2025, 7:10am

Да. Исправь его и выведи. Спасибо

RunLLM · August 25, 2025, 7:11am

Вот исправленный фрагмент функции debug_llm для Qwen2.5-VL-7B-Instruct, использующий endpoint /chat/completions и правильный формат messages:

def debug_llm(input_dict):
    formatted_prompt = prompt.invoke(input_dict)
    print(f"DEBUG: Formatted prompt sent to LLM: {formatted_prompt}")
    try:
        payload = {
            "model": llm.model_name,
            "messages": [
                {"role": "user", "content": formatted_prompt.text}
            ],
            "max_tokens": llm.max_tokens,
            "temperature": llm.temperature,
            "top_p": llm.top_p
        }
        print(f"DEBUG: Request payload to vLLM: {json.dumps(payload, indent=2, ensure_ascii=False)}")
        response = requests.post(
            f"{llm.openai_api_base}/chat/completions",
            json=payload,
            headers={"Content-Type": "application/json"}
        )
        print(f"DEBUG: Raw vLLM response: {response.text}")
        response.raise_for_status()
        result = response.json()
        if "choices" in result and result["choices"]:
            return result["choices"][0]["message"]["content"]
        else:
            print("ERROR: No choices in response")
            return "LLM returned no response"
    except Exception as e:
        print(f"ERROR: LLM invocation failed: {str(e)}")
        return f"LLM error: {str(e)}"

Теперь запросы будут корректно работать с Qwen2.5-VL-7B-Instruct через /chat/completions.

Sources:

OpenAI Chat Completions API with vLLM

Vladimir · August 25, 2025, 7:16am

RunLLM:

def debug_llm(input_dict):
    formatted_prompt = prompt.invoke(input_dict)
    print(f"DEBUG: Formatted prompt sent to LLM: {formatted_prompt}")
    try:
        payload = {
            "model": llm.model_name,
            "messages": [
                {"role": "user", "content": formatted_prompt.text}
            ],
            "max_tokens": llm.max_tokens,
            "temperature": llm.temperature,
            "top_p": llm.top_p
        }
        print(f"DEBUG: Request payload to vLLM: {json.dumps(payload, indent=2, ensure_ascii=False)}")
        response = requests.post(
            f"{llm.openai_api_base}/chat/completions",
            json=payload,
            headers={"Content-Type": "application/json"}
        )
        print(f"DEBUG: Raw vLLM response: {response.text}")
        response.raise_for_status()
        result = response.json()
        if "choices" in result and result["choices"]:
            return result["choices"][0]["message"]["content"]
        else:
            print("ERROR: No choices in response")
            return "LLM returned no response"
    except Exception as e:
        print(f"ERROR: LLM invocation failed: {str(e)}")
        return f"LLM error: {str(e)}"

Теперь пишет
Ошибка тестового запроса: ‘NoneType’ object has no attribute ‘invoke’
Error in sys.excepthook:

Original exception was:

RunLLM · August 25, 2025, 7:17am

Ошибка “‘NoneType’ object has no attribute ‘invoke’” означает, что переменная prompt не инициализирована или равна None в момент вызова prompt.invoke(input_dict). Это не связано с vLLM или API, а с логикой вашего кода: убедитесь, что prompt = get_qa_prompt() вызывается и prompt передаётся в debug_llm либо как глобальная переменная, либо как аргумент функции.

Решение:
Передавайте prompt явно в debug_llm, например:

def debug_llm(input_dict, prompt):
    formatted_prompt = prompt.invoke(input_dict)
    ...

Или убедитесь, что prompt определён глобально и инициализирован до вызова debug_llm.

Хотите полный пример исправленного вызова?

Sources:

retrieval_augmented_generation_with_langchain.py

Vladimir · August 25, 2025, 7:17am

Да. Хочу целиком исправленный

RunLLM · August 25, 2025, 7:18am

Вот полностью исправленный пример функции debug_llm и интеграции с цепочкой, чтобы избежать ошибки с prompt и корректно работать с Qwen2.5-VL-7B-Instruct через /chat/completions:

def debug_llm(input_dict, prompt, llm):
    formatted_prompt = prompt.invoke(input_dict)
    print(f"DEBUG: Formatted prompt sent to LLM: {formatted_prompt}")
    try:
        payload = {
            "model": llm.model_name,
            "messages": [
                {"role": "user", "content": formatted_prompt.text}
            ],
            "max_tokens": llm.max_tokens,
            "temperature": llm.temperature,
            "top_p": llm.top_p
        }
        print(f"DEBUG: Request payload to vLLM: {json.dumps(payload, indent=2, ensure_ascii=False)}")
        response = requests.post(
            f"{llm.openai_api_base}/chat/completions",
            json=payload,
            headers={"Content-Type": "application/json"}
        )
        print(f"DEBUG: Raw vLLM response: {response.text}")
        response.raise_for_status()
        result = response.json()
        if "choices" in result and result["choices"]:
            return result["choices"][0]["message"]["content"]
        else:
            print("ERROR: No choices in response")
            return "LLM returned no response"
    except Exception as e:
        print(f"ERROR: LLM invocation failed: {str(e)}")
        return f"LLM error: {str(e)}"

# Пример интеграции с LangChain chain:
def create_qa_chain(retriever, llm, prompt):
    def debug_llm_wrapper(input_dict):
        return debug_llm(input_dict, prompt, llm)
    return (
        {
            "context": retriever | format_docs,
            "question": RunnablePassthrough(),
        }
        | RunnableLambda(debug_llm_wrapper)
        | StrOutputParser()
    )

Теперь prompt и llm передаются явно, и ошибка ‘NoneType’ object has no attribute ‘invoke’ не возникнет.

Sources:

OpenAI Chat Completions API with vLLM

Vladimir · August 25, 2025, 7:22am

RunLLM:

def debug_llm(input_dict, prompt, llm):
    formatted_prompt = prompt.invoke(input_dict)
    print(f"DEBUG: Formatted prompt sent to LLM: {formatted_prompt}")
    try:
        payload = {
            "model": llm.model_name,
            "messages": [
                {"role": "user", "content": formatted_prompt.text}
            ],
            "max_tokens": llm.max_tokens,
            "temperature": llm.temperature,
            "top_p": llm.top_p
        }
        print(f"DEBUG: Request payload to vLLM: {json.dumps(payload, indent=2, ensure_ascii=False)}")
        response = requests.post(
            f"{llm.openai_api_base}/chat/completions",
            json=payload,
            headers={"Content-Type": "application/json"}
        )
        print(f"DEBUG: Raw vLLM response: {response.text}")
        response.raise_for_status()
        result = response.json()
        if "choices" in result and result["choices"]:
            return result["choices"][0]["message"]["content"]
        else:
            print("ERROR: No choices in response")
            return "LLM returned no response"
    except Exception as e:
        print(f"ERROR: LLM invocation failed: {str(e)}")
        return f"LLM error: {str(e)}"

Запускаю командой
(vllm_env) deepseek-fb@deepseek-fb1:~$ OPENAI_API_KEY=“dummy” python retrieval_augmented_generation_with_langchain.py --file-path “result.txt” --milvus-host localhost --milvus-port 19530 --vllm-api-base “http://192.168.13.140:8000/v1/completions” --chat-model “/mnt/nfs/models/Qwen2.5-VL-7B-Instruct/”
Ошибка тестового запроса: ‘NoneType’ object has no attribute ‘invoke’
Error in sys.excepthook:

Original exception was:

Topic		Replies	Views
Error deploy model with ray General	5	196	August 17, 2025
vLLM cannot connect to existing Ray cluster Ray	16	744	May 8, 2025
Connecting to Ray Cluster and Launching vLLM with TP=8 and PP=2 in Deepseek R1 Mode on Kubernetes General	1	170	July 6, 2025
KeyError: 'model.layers.6.mlp.experts.w2_weight_scale' General	0	345	April 18, 2025
Failed to run distribute Inference with vllm 0.8.2 General	6	632	April 27, 2025

Ray cluster DeepSeek-R1-Distill-Qwen-32B-AWQ

Test response:

Related topics