When using large batches, the Ray service crashes.ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read

Hi, I encountered a very similar issue with RayChannelTimeoutError and vLLM crashing when running distributed inference with a large model and batch size. Here are the details:

Environment:

  • vLLM version: 0.9.0 (V1 API)
  • Model: InternVL3-78B
  • GPU: 8 x 95GB (NVIDIA)
  • CUDA: 12.x
  • Python: 3.11
  • OS: Linux 5.4.119-19.0009.56 (Docker container)
  • Ray backend: enabled (–distributed-executor-backend ray)
  • vLLM launch command:
vllm serve /mnt/afs/share_models/git_models/OpenGVLab/InternVL3-78B/ \ --port 6699 \ --host 0.0.0.0 \ --tensor-parallel-size 8 \ --distributed-executor-backend ray \ --disable-custom-all-reduce \ --dtype half \ --max-model-len 32768 \ --trust-remote-code \ --max-num-batched-tokens 61400 \ --max-num-seqs 128 \ --gpu-memory-utilization 0.95 \ --swap-space 0 \ --cpu-offload-gb 0 \ --block-size 32 \ --enable-chunked-prefill \ --enable-prefix-caching \ --max-seq-len-to-capture 8192 \ --limit-mm-per-prompt '{"image": 1}' \ --disable-log-requests \ --disable-log-stats \ --load-format auto

Client:
Batch size: 16
Using vLLM OpenAI API (/v1/chat/completions)

What happens:

  • The first batch of requests (16 samples) completes successfully.
  • Starting from the second batch, all requests fail with HTTP 500 errors.
  • Client logs show:
    Connection pool is full, discarding connection
    Failed to establish a new connection: [Errno 111] Connection refused
  • Server logs show:
    ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read.
    vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
    Ray workers are killed, SIGTERM received, server process exits.

Sample server log:

ERROR [core.py:502] ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read. ObjectID: ... ERROR [core.py:502] The above exception was the direct cause of the following exception: ERROR [core.py:502] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. ... INFO:     ... "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error ...

Additional details:

  • Each input sample consists of one image (about 256 tokens after encoding) and 500 text tokens; the average output is about 200 tokens.
  • On the client side, I use a thread pool to send requests concurrently. The thread pool size is set to min(len(batch_requests), 64), i.e., no more than 8 threads per batch, to avoid too many concurrent threads.
  • Currently, if I set both the batch size and max_workers to 8, the system runs stably. However, increasing either to 16 immediately triggers the RayChannelTimeoutError and server crash.
  • With similar parameters, I have previously run the pipeline at nearly 10x the current speed, which makes the current performance and stability seem unreasonable.

What I have tried:

  • Reducing batch size to 8 or 4 (still crashes, but sometimes after more batches)
  • Lowering max-num-seqs and max-num-batched-tokens (no fundamental improvement)
  • Monitoring GPU/CPU/memory usage (no OOM, but Ray workers die after a few batches)
  • Disabling --enable-chunked-prefill and --enable-prefix-caching (no effect)
  • Single-node, single-GPU mode works fine (no crash)
  • All model weights are complete and compatible

Summary:

  • The issue only occurs with Ray distributed backend and large model/batch.
  • First batch always works, subsequent batches trigger RayChannelTimeoutError and server crash.
  • This seems to be the same root cause as described in this issue.
  • Any advice or workaround would be greatly appreciated!
  • Is there a recommended way to stabilize vLLM distributed inference for large models?
  • Are there any Ray/vLLM config tweaks or patches that help?
  • Is this a known bug with a fix in a newer version?

Thank you!

client log details:

root@5a80c4dad3b0:/mnt/afs/smartbrain/zhukangkang/code/common_data_pipeline/stage3_valid# python stage3_vadlid.py --input /mnt/afs/smartbrain/zhukangkang/code/common_data_pipeline/stage1_cls/output_stage1/llavaOV_sensenova_inf_30w_35w_48199/llavaOV_sensenova_inf_30w_35w_48199_cls.json --api-url http://172.21.0.67:6699 --batch-size 16
[运行模式] 纯CPU模式:CPU模式
[运行模式] 纯CPU模式:CPU模式
2025-07-29 14:08:00,701 [INFO] [DIAG] 开始环境诊断(vLLM批推理模式)...
2025-07-29 14:08:00,703 [INFO] [DIAG] 系统: Linux 5.4.119-19.0009.56
2025-07-29 14:08:00,703 [INFO] [DIAG] 主机名: 5a80c4dad3b0
2025-07-29 14:08:00,703 [INFO] [DIAG] CPU核心数: 128
2025-07-29 14:08:00,703 [INFO] [DIAG] 容器环境: 是
2025-07-29 14:08:00,703 [INFO] [DIAG] vLLM API地址: http://172.21.0.67:6699
2025-07-29 14:08:00,707 [INFO] [DIAG] ✅ vLLM API服务连接正常
2025-07-29 14:08:00,707 [INFO] [DIAG] 💡 vLLM批推理版本优势:
2025-07-29 14:08:00,707 [INFO] [DIAG]   - 智能批处理:根据vLLM服务配置自动调整批次大小
2025-07-29 14:08:00,707 [INFO] [DIAG]   - 并发请求:利用线程池最大化vLLM服务的并发处理能力
2025-07-29 14:08:00,707 [INFO] [DIAG]   - 高效处理:批量处理提升推理效率
2025-07-29 14:08:00,707 [INFO] [DIAG]   - 资源友好:纯CPU运行,不占用本地GPU资源
2025-07-29 14:08:00,707 [INFO] [DIAG]   - 宽松标准:部分正确也算正确,提高通过率
2025-07-29 14:08:00,707 [INFO] [DIAG]   - 远程vLLM:使用http://172.21.0.67:6699的远程vLLM服务
2025-07-29 14:08:00,707 [INFO] [DIAG] 🔧 调试建议:如遇到问题,请尝试:
2025-07-29 14:08:00,707 [INFO] [DIAG]   - 添加 --test-mode --batch-size 4 --debug 参数
2025-07-29 14:08:00,707 [INFO] [DIAG]   - 检查远程vLLM服务是否正常运行
2025-07-29 14:08:00,707 [INFO] [DIAG]   - 确认网络连接到http://172.21.0.67:6699
2025-07-29 14:08:00,707 [INFO] [DIAG] 📋 必需参数提醒:
2025-07-29 14:08:00,707 [INFO] [DIAG]   - 必须指定 --api-url 或 --host + --port 参数
2025-07-29 14:08:00,707 [INFO] [DIAG]   - 示例: --api-url http://172.21.0.36:6666
2025-07-29 14:08:00,707 [INFO] [DIAG]   - 或者: --host 172.21.0.36 --port 6666
2025-07-29 14:08:00,707 [INFO] [DIAG] 🎯 vLLM批推理说明:
2025-07-29 14:08:00,707 [INFO] [DIAG]   - 智能批处理:自动根据vLLM的max_num_seqs配置调整
2025-07-29 14:08:00,707 [INFO] [DIAG]   - 并发优化:每批最多8个并发线程,避免服务过载
2025-07-29 14:08:00,707 [INFO] [DIAG]   - 批次大小:建议4-16(会自动分割为更小的并发批次)
2025-07-29 14:08:00,707 [INFO] [DIAG]   - 服务配置:检测到您的vLLM配置max_num_seqs=32
2025-07-29 14:08:00,707 [INFO] ============================================================
2025-07-29 14:08:00,707 [INFO] Stage-3 Valid InternVL-78B 一步验证脚本(vLLM批推理版本)
2025-07-29 14:08:00,707 [INFO] ============================================================
2025-07-29 14:08:00,707 [INFO] 🖥️ 运行模式: CPU模式
2025-07-29 14:08:00,707 [INFO] 📁 输入文件: /mnt/afs/smartbrain/zhukangkang/code/common_data_pipeline/stage1_cls/output_stage1/llavaOV_sensenova_inf_30w_35w_48199/llavaOV_sensenova_inf_30w_35w_48199_cls.json
2025-07-29 14:08:00,707 [INFO] 📂 输出根目录: /mnt/afs/smartbrain/zhukangkang/code/common_data_pipeline/stage3_valid/output_stage3
2025-07-29 14:08:00,707 [INFO] 📋 实际输出目录: /mnt/afs/smartbrain/zhukangkang/code/common_data_pipeline/stage3_valid/output_stage3/llavaOV_sensenova_inf_30w_35w_48199_cls/
2025-07-29 14:08:00,707 [INFO]    ├── llavaOV_sensenova_inf_30w_35w_48199_cls_valid.json      # 验证通过的数据
2025-07-29 14:08:00,707 [INFO]    ├── llavaOV_sensenova_inf_30w_35w_48199_cls_invalid.json    # 验证失败的数据
2025-07-29 14:08:00,708 [INFO]    ├── llavaOV_sensenova_inf_30w_35w_48199_cls_report.txt      # 详细报告
2025-07-29 14:08:00,708 [INFO]    └── checkpoints/                     # 检查点文件
2025-07-29 14:08:00,708 [INFO] ⚙️ vLLM批次大小: 16
2025-07-29 14:08:00,708 [INFO] 🌐 vLLM API地址: http://172.21.0.67:6699
2025-07-29 14:08:00,708 [INFO] 🔁 请求重试次数: 3
2025-07-29 14:08:00,708 [INFO] 💾 检查点间隔: 每 100 条数据
2025-07-29 14:08:00,708 [INFO] ✨ 验证模式: 一步宽松验证(部分正确也算正确)
2025-07-29 14:08:00,708 [INFO] 🚀 推理方式: vLLM原生批推理格式
2025-07-29 14:08:00,708 [INFO] ⚡ 资源优化: 纯CPU运行,使用远程vLLM服务
2025-07-29 14:08:00,708 [INFO] 🎯 模型路径: /mnt/afs/share_models/git_models/OpenGVLab/InternVL3-78B
2025-07-29 14:08:00,708 [INFO] ============================================================
2025-07-29 14:08:00,708 [INFO] 正在测试vLLM API连接: http://172.21.0.67:6699
2025-07-29 14:08:00,711 [INFO] ✅ vLLM API健康检查通过
2025-07-29 14:08:00,714 [INFO] 可用模型: {'object': 'list', 'data': [{'id': '/mnt/afs/share_models/git_models/OpenGVLab/InternVL3-78B/', 'object': 'model', 'created': 1753769280, 'owned_by': 'vllm', 'root': '/mnt/afs/share_models/git_models/OpenGVLab/InternVL3-78B/', 'parent': None, 'max_model_len': 32768, 'permission': [{'id': 'modelperm-4af0ba68d6ae4301a7c40d5c0345c0e4', 'object': 'model_permission', 'created': 1753769280, 'allow_create_engine': False, 'allow_sampling': True, 'allow_logprobs': True, 'allow_search_indices': False, 'allow_view': True, 'allow_fine_tuning': False, 'organization': '*', 'group': None, 'is_blocking': False}]}]}
2025-07-29 14:08:00,714 [INFO] ✅ vLLM API连接配置完成: http://172.21.0.67:6699
2025-07-29 14:08:00,714 [INFO] 🔄 检查点恢复: 启用
2025-07-29 14:08:09,656 [INFO] 🔍 检查点状态:
2025-07-29 14:08:09,657 [INFO]    validation_progress: 32200 个样本已处理
2025-07-29 14:08:09,657 [INFO]    validation_batch: 32160 个样本已处理
2025-07-29 14:08:09,657 [INFO] 📊 将从进度检查点恢复: 32200 个样本
2025-07-29 14:08:09,657 [INFO] 开始运行 Stage-3 Valid 一步验证流程(vLLM批推理版本)...
2025-07-29 14:08:09,657 [INFO] 所有输出将保存至: /mnt/afs/smartbrain/zhukangkang/code/common_data_pipeline/stage3_valid/output_stage3/llavaOV_sensenova_inf_30w_35w_48199_cls
2025-07-29 14:08:09,657 [INFO] 加载数据集: /mnt/afs/smartbrain/zhukangkang/code/common_data_pipeline/stage1_cls/output_stage1/llavaOV_sensenova_inf_30w_35w_48199/llavaOV_sensenova_inf_30w_35w_48199_cls.json
2025-07-29 14:08:11,695 [INFO] 成功加载并标准化 48199 / 48199 条数据
2025-07-29 14:08:19,044 [INFO] 检查点将保存至: /mnt/afs/smartbrain/zhukangkang/code/common_data_pipeline/stage3_valid/output_stage3/llavaOV_sensenova_inf_30w_35w_48199_cls/checkpoints
2025-07-29 14:08:22,866 [INFO] [CHECKPOINT] 发现 3 个检查点,尝试加载最新的...
2025-07-29 14:08:22,866 [INFO] [CHECKPOINT] 尝试加载检查点: 32200 个已处理样本
2025-07-29 14:08:24,097 [INFO] [CHECKPOINT] ✅ 成功加载检查点: 正确=26005, 错误=6195, 总计=32200
2025-07-29 14:08:24,097 [INFO] [RESUME] 🔄 从检查点恢复验证流程
2025-07-29 14:08:24,097 [INFO] [RESUME] 📊 已处理: 32200 / 48199 个样本
2025-07-29 14:08:24,097 [INFO] [RESUME] 📈 检查点数据: 正确=26005, 错误=6195
2025-07-29 14:08:24,097 [INFO] [RESUME] 📋 还需处理: 15999 个样本
2025-07-29 14:08:24,097 [INFO] >>> 继续一步验证(从检查点恢复)...
2025-07-29 14:08:24,097 [INFO] 继续处理剩余 15999 个样本...
2025-07-29 14:08:24,098 [INFO] [VALIDATION] 恢复验证: 15999个样本,分1000个批次处理
2025-07-29 14:08:24,098 [INFO] [VALIDATION] 🔄 恢复模式: 从第 32201 个样本开始
2025-07-29 14:08:24,098 [INFO] [VALIDATION] 📊 已有数据: 正确=26005, 错误=6195
2025-07-29 14:08:24,098 [INFO] [VALIDATION] 批处理大小: 16, 预计总时间: 约8000秒
恢复验证:   0%|                                                                                                        | 0/1000 [00:00<?, ?batch/s]2025-07-29 14:08:24,106 [INFO] [VALIDATION] ==================== 批次 1/1000 ====================
2025-07-29 14:08:24,106 [INFO] [VALIDATION] 全局样本范围: 32201-32216
2025-07-29 14:08:24,106 [INFO] [VALIDATION] 已处理总数: 32200个样本
2025-07-29 14:08:24,106 [INFO] [vLLM-BATCH] 开始处理批次,大小=16,首个样本ID=id_333620
2025-07-29 14:08:24,106 [INFO] [vLLM-BATCH] 开始准备vLLM批推理数据...
2025-07-29 14:08:24,788 [INFO] [vLLM-BATCH] 数据准备完成: 有效=16, 无效=0
2025-07-29 14:08:24,788 [INFO] [vLLM-INFERENCE] 开始vLLM批推理,有效样本数=16
2025-07-29 14:08:24,788 [INFO] [vLLM-INFERENCE] 准备发送vLLM批推理请求,数量=16
2025-07-29 14:08:24,788 [INFO] [vLLM-API] 使用vLLM批推理处理 16 个请求
2025-07-29 14:08:24,788 [INFO] [vLLM-API] 处理批次 1,大小: 16
2025-07-29 14:08:33,974 [WARNING] Connection pool is full, discarding connection: 172.21.0.67. Connection pool size: 10
2025-07-29 14:08:34,180 [WARNING] Connection pool is full, discarding connection: 172.21.0.67. Connection pool size: 10
2025-07-29 14:08:34,200 [WARNING] Connection pool is full, discarding connection: 172.21.0.67. Connection pool size: 10
2025-07-29 14:08:34,218 [WARNING] Connection pool is full, discarding connection: 172.21.0.67. Connection pool size: 10
2025-07-29 14:08:34,309 [WARNING] Connection pool is full, discarding connection: 172.21.0.67. Connection pool size: 10
2025-07-29 14:08:36,435 [WARNING] Connection pool is full, discarding connection: 172.21.0.67. Connection pool size: 10
2025-07-29 14:08:36,437 [INFO] [vLLM-API] 批次完成,成功: 16/16
2025-07-29 14:08:36,437 [INFO] [vLLM-PARSE] 开始解析vLLM批推理结果,响应数=16
2025-07-29 14:08:36,437 [INFO] [vLLM-PARSE] 结果解析完成: 总结果=16
2025-07-29 14:08:36,437 [INFO] [vLLM-PARSE] 批次统计: 成功=13/16
2025-07-29 14:08:36,438 [INFO] [VALIDATION] 批次1完成,耗时: 12.33秒
2025-07-29 14:08:36,438 [INFO] [VALIDATION] 批次1结果: 正确=13, 错误=3
2025-07-29 14:08:36,438 [INFO] [VALIDATION] 累计统计: 处理=32216, 成功率=80.8%
恢复验证:   0%|                                                         | 1/1000 [00:12<3:25:19, 12.33s/batch, processed=32216, success_rate=80.8%]2025-07-29 14:08:36,438 [INFO] [VALIDATION] ==================== 批次 2/1000 ====================
2025-07-29 14:08:36,438 [INFO] [VALIDATION] 全局样本范围: 32217-32232
2025-07-29 14:08:36,438 [INFO] [VALIDATION] 已处理总数: 32216个样本
2025-07-29 14:08:36,438 [INFO] [vLLM-BATCH] 开始处理批次,大小=16,首个样本ID=id_333634
2025-07-29 14:08:36,438 [INFO] [vLLM-BATCH] 开始准备vLLM批推理数据...
2025-07-29 14:08:36,968 [INFO] [vLLM-BATCH] 数据准备完成: 有效=16, 无效=0
2025-07-29 14:08:36,968 [INFO] [vLLM-INFERENCE] 开始vLLM批推理,有效样本数=16
2025-07-29 14:08:36,969 [INFO] [vLLM-INFERENCE] 准备发送vLLM批推理请求,数量=16
2025-07-29 14:08:36,969 [INFO] [vLLM-API] 使用vLLM批推理处理 16 个请求
2025-07-29 14:08:36,969 [INFO] [vLLM-API] 处理批次 1,大小: 16
2025-07-29 14:08:47,841 [WARNING] [vLLM-API] 请求 2 失败: vLLM API请求失败,状态码: 500
2025-07-29 14:08:47,841 [WARNING] [vLLM-API] 请求 4 失败: vLLM API请求失败,状态码: 500
2025-07-29 14:08:47,842 [WARNING] [vLLM-API] 请求 8 失败: vLLM API请求失败,状态码: 500
2025-07-29 14:08:47,842 [WARNING] [vLLM-API] 请求 12 失败: vLLM API请求失败,状态码: 500
2025-07-29 14:08:47,842 [WARNING] [vLLM-API] 请求 10 失败: vLLM API请求失败,状态码: 500
2025-07-29 14:08:47,842 [WARNING] [vLLM-API] 请求 11 失败: vLLM API请求失败,状态码: 500
2025-07-29 14:08:47,843 [WARNING] [vLLM-API] 请求 6 失败: vLLM API请求失败,状态码: 500
2025-07-29 14:08:47,843 [WARNING] [vLLM-API] 请求 14 失败: vLLM API请求失败,状态码: 500
2025-07-29 14:08:47,843 [WARNING] [vLLM-API] 请求 7 失败: vLLM API请求失败,状态码: 500
2025-07-29 14:08:47,843 [WARNING] [vLLM-API] 请求 1 失败: vLLM API请求失败,状态码: 500
2025-07-29 14:08:47,845 [WARNING] Connection pool is full, discarding connection: 172.21.0.67. Connection pool size: 10
2025-07-29 14:08:47,845 [WARNING] Connection pool is full, discarding connection: 172.21.0.67. Connection pool size: 10
2025-07-29 14:08:47,845 [WARNING] Connection pool is full, discarding connection: 172.21.0.67. Connection pool size: 10
2025-07-29 14:08:47,845 [WARNING] Connection pool is full, discarding connection: 172.21.0.67. Connection pool size: 10
2025-07-29 14:08:47,845 [WARNING] Connection pool is full, discarding connection: 172.21.0.67. Connection pool size: 10
2025-07-29 14:08:47,845 [WARNING] Connection pool is full, discarding connection: 172.21.0.67. Connection pool size: 10
2025-07-29 14:08:47,846 [WARNING] [vLLM-API] 请求 13 失败: vLLM API请求失败,状态码: 500
2025-07-29 14:08:47,846 [WARNING] [vLLM-API] 请求 5 失败: vLLM API请求失败,状态码: 500
2025-07-29 14:08:47,846 [WARNING] [vLLM-API] 请求 3 失败: vLLM API请求失败,状态码: 500
2025-07-29 14:08:47,846 [WARNING] [vLLM-API] 请求 9 失败: vLLM API请求失败,状态码: 500
2025-07-29 14:08:47,847 [WARNING] [vLLM-API] 请求 16 失败: vLLM API请求失败,状态码: 500
2025-07-29 14:08:47,847 [WARNING] [vLLM-API] 请求 15 失败: vLLM API请求失败,状态码: 500
2025-07-29 14:08:47,848 [INFO] [vLLM-API] 批次完成,成功: 0/16
2025-07-29 14:08:47,848 [INFO] [vLLM-PARSE] 开始解析vLLM批推理结果,响应数=16
2025-07-29 14:08:47,848 [INFO] [vLLM-PARSE] 结果解析完成: 总结果=16
2025-07-29 14:08:47,848 [INFO] [vLLM-PARSE] 批次统计: 成功=0/16
2025-07-29 14:08:47,849 [INFO] [VALIDATION] 批次2完成,耗时: 11.41秒
2025-07-29 14:08:47,849 [INFO] [VALIDATION] 批次2结果: 正确=0, 错误=16
2025-07-29 14:08:47,849 [INFO] [VALIDATION] 累计统计: 处理=32232, 成功率=80.7%
恢复验证:   0%|                                                         | 1/1000 [00:23<3:25:19, 12.33s/batch, processed=32232, success_rate=80.7%]2025-07-29 14:08:47,849 [INFO] [VALIDATION] 预计剩余时间: 189.8分钟
恢复验证:   0%|                                                         | 2/1000 [00:23<3:16:06, 11.79s/batch, processed=32232, success_rate=80.7%]2025-07-29 14:08:47,849 [INFO] [VALIDATION] ==================== 批次 3/1000 ====================
2025-07-29 14:08:47,849 [INFO] [VALIDATION] 全局样本范围: 32233-32248
2025-07-29 14:08:47,849 [INFO] [VALIDATION] 已处理总数: 32232个样本
2025-07-29 14:08:47,849 [INFO] [vLLM-BATCH] 开始处理批次,大小=16,首个样本ID=id_333659
2025-07-29 14:08:47,849 [INFO] [vLLM-BATCH] 开始准备vLLM批推理数据...
2025-07-29 14:08:48,025 [INFO] [vLLM-BATCH] 数据准备完成: 有效=16, 无效=0
2025-07-29 14:08:48,025 [INFO] [vLLM-INFERENCE] 开始vLLM批推理,有效样本数=16
2025-07-29 14:08:48,025 [INFO] [vLLM-INFERENCE] 准备发送vLLM批推理请求,数量=16
2025-07-29 14:08:48,025 [INFO] [vLLM-API] 使用vLLM批推理处理 16 个请求
2025-07-29 14:08:48,025 [INFO] [vLLM-API] 处理批次 1,大小: 16
2025-07-29 14:08:48,034 [ERROR] [vLLM-API] 请求 5 异常: 请求异常: HTTPConnectionPool(host='172.21.0.67', port=6699): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3637c1f210>: Failed to establish a new connection: [Errno 111] Connection refused'))
2025-07-29 14:08:48,034 [ERROR] [vLLM-API] 请求 4 异常: 请求异常: HTTPConnectionPool(host='172.21.0.67', port=6699): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f364273d390>: Failed to establish a new connection: [Errno 111] Connection refused'))
2025-07-29 14:08:48,035 [ERROR] [vLLM-API] 请求 1 异常: 请求异常: HTTPConnectionPool(host='172.21.0.67', port=6699): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3637c120d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
2025-07-29 14:08:48,037 [ERROR] [vLLM-API] 请求 6 异常: 请求异常: HTTPConnectionPool(host='172.21.0.67', port=6699): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3637c18850>: Failed to establish a new connection: [Errno 111] Connection refused'))
2025-07-29 14:08:48,039 [ERROR] [vLLM-API] 请求 9 异常: 请求异常: HTTPConnectionPool(host='172.21.0.67', port=6699): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3637bcf5d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
2025-07-29 14:08:48,039 [ERROR] [vLLM-API] 请求 8 异常: 请求异常: HTTPConnectionPool(host='172.21.0.67', port=6699): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3637bced90>: Failed to establish a new connection: [Errno 111] Connection refused'))
2025-07-29 14:08:48,059 [ERROR] [vLLM-API] 请求 13 异常: 请求异常: HTTPConnectionPool(host='172.21.0.67', port=6699): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3637c050d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
2025-07-29 14:08:48,059 [ERROR] [vLLM-API] 请求 11 异常: 请求异常: HTTPConnectionPool(host='172.21.0.67', port=6699): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3637bfdbd0>: Failed to establish a new connection: [Errno 111] Connection refused'))
2025-07-29 14:08:48,059 [ERROR] [vLLM-API] 请求 12 异常: 请求异常: HTTPConnectionPool(host='172.21.0.67', port=6699): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3637c06290>: Failed to establish a new connection: [Errno 111] Connection refused'))
2025-07-29 14:08:48,060 [ERROR] [vLLM-API] 请求 14 异常: 请求异常: HTTPConnectionPool(host='172.21.0.67', port=6699): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3637bcfad0>: Failed to establish a new connection: [Errno 111] Connection refused'))
2025-07-29 14:08:48,062 [ERROR] [vLLM-API] 请求 16 异常: 请求异常: HTTPConnectionPool(host='172.21.0.67', port=6699): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3637c1ac50>: Failed to establish a new connection: [Errno 111] Connection refused'))
2025-07-29 14:08:48,062 [ERROR] [vLLM-API] 请求 10 异常: 请求异常: HTTPConnectionPool(host='172.21.0.67', port=6699): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3637c05dd0>: Failed to establish a new connection: [Errno 111] Connection refused'))
2025-07-29 14:08:48,076 [ERROR] [vLLM-API] 请求 3 异常: 请求异常: HTTPConnectionPool(host='172.21.0.67', port=6699): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3637c24810>: Failed to establish a new connection: [Errno 111] Connection refused'))
2025-07-29 14:08:48,078 [ERROR] [vLLM-API] 请求 15 异常: 请求异常: HTTPConnectionPool(host='172.21.0.67', port=6699): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3637c1a950>: Failed to establish a new connection: [Errno 111] Connection refused'))
2025-07-29 14:08:48,078 [ERROR] [vLLM-API] 请求 2 异常: 请求异常: HTTPConnectionPool(host='172.21.0.67', port=6699): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3637c07450>: Failed to establish a new connection: [Errno 111] Connection refused'))
2025-07-29 14:08:48,079 [ERROR] [vLLM-API] 请求 7 异常: 请求异常: HTTPConnectionPool(host='172.21.0.67', port=6699): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3637c1d410>: Failed to establish a new connection: [Errno 111] Connection refused'))

This is a known issue with vLLM distributed inference using the Ray backend, especially with large models and high batch sizes. The RayChannelTimeoutError and subsequent server crash are typically caused by Ray’s default timeouts being too short for large, slow batches, or by resource exhaustion (e.g., memory, file descriptors, or network bottlenecks), even if GPU/CPU usage appears normal. Increasing batch size or concurrency exacerbates the problem, and the first batch often works because resources are not yet saturated or Ray’s internal state is not yet stressed. This is not unique to your setup and has been reported by others with similar symptoms and logs (vLLM Issue #15102, vLLM Issue #17881).

Recommended mitigations:

  • Increase Ray timeouts by setting environment variables RAY_CGRAPH_submit_timeout and RAY_CGRAPH_get_timeout to higher values (e.g., 60–120 seconds or more) (vLLM Issue #15102).
  • Reduce max-num-seqs, max-num-batched-tokens, or batch size to lower per-batch memory and communication load.
  • Monitor Ray worker logs for OOM or resource errors, and check for network or shared memory bottlenecks.
  • Ensure /dev/shm is large enough and use --ipc=host in Docker (vLLM Issue #859).
  • If possible, use the mp (multiprocessing) backend for single-node, multi-GPU runs, as it is more stable than Ray for large models (Distributed Serving Docs).
  • If you must use Ray, try the latest vLLM and Ray versions, as some bugs have been fixed in recent releases.

Would you like a step-by-step guide for applying these mitigations or more detail on debugging Ray timeouts and resource issues?

Sources:

serve logs:

[2025-07-29 14:08:01] INFO: 172.21.0.13:39718 - "GET /health HTTP/1.1" 200 OK
[2025-07-29 14:08:01] INFO: 172.21.0.13:39720 - "GET /health HTTP/1.1" 200 OK
[2025-07-29 14:08:01] INFO: 172.21.0.13:39720 - "GET /v1/models HTTP/1.1" 200 OK
[2025-07-29 14:08:34] INFO: 172.21.0.13:40064 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-07-29 14:08:34] INFO: 172.21.0.13:40036 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-07-29 14:08:34] INFO: 172.21.0.13:40054 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-07-29 14:08:34] INFO: 172.21.0.13:40050 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-07-29 14:08:34] INFO: 172.21.0.13:40062 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-07-29 14:08:34] INFO: 172.21.0.13:40042 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-07-29 14:08:34] INFO: 172.21.0.13:40046 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-07-29 14:08:34] INFO: 172.21.0.13:40052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-07-29 14:08:34] INFO: 172.21.0.13:40040 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-07-29 14:08:34] INFO: 172.21.0.13:40058 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-07-29 14:08:34] INFO: 172.21.0.13:40060 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-07-29 14:08:35] INFO: 172.21.0.13:40056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-07-29 14:08:35] INFO: 172.21.0.13:40034 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-07-29 14:08:35] INFO: 172.21.0.13:40048 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-07-29 14:08:35] INFO: 172.21.0.13:40044 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-07-29 14:08:37] INFO: 172.21.0.13:40038 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:68] Dumping input data
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:70] V1 LLM engine (v0.9.0) with config: model='/mnt/afs/share_models/git_models/OpenGVLab/InternVL3-78B/', speculative_config=None, tokenizer='/mnt/afs/share_models/git_models/OpenGVLab/InternVL3-78B/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/mnt/afs/share_models/git_models/OpenGVLab/InternVL3-78B/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level": 3, "custom_ops": ["none"], "splitting_ops": ["vllm.unified_attention", "vllm.unified_attention_with_output"], "compile_sizes": [], "inductor_compile_config": {"enable_auto_functionalized_v2": false}, "use_cudagraph": true, "cudagraph_num_of_warmups": 1, "cudagraph_capture_sizes": [512, 504, 496, 488, 480, 472, 464, 456, 448, 440, 432, 424, 416, 408, 400, 392, 384, 376, 368, 360, 352, 344, 336, 328, 320, 312, 304, 296, 288, 280, 272, 264, 256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "max_capture_size": 512},
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:78] Dumping scheduler output for model execution:
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-d0933e8106704ddfa5305c6780635551,prompt_token_ids_len=1256,mm_inputs=[{'image_token_id': tensor(151667), 'image_num_patches': tensor([3]), 'pixel_values_flat': tensor([[[[2.2489, 2.2489, 2.2489, ..., 2.2489, 2.2489, 2.2489],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.2489, 2.2489, 2.2489, ..., 2.2489, 2.2489, 2.2489],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.2489, 2.2489, 2.2489, ..., 2.2489, 2.2489, 2.2489],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] ...,
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.2489, 2.2489, 2.2489, ..., 2.2489, 2.2489, 2.2489],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.2489, 2.2489, 2.2489, ..., 2.2489, 2.2489, 2.2489],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.2489, 2.2489, 2.2489, ..., 2.2489, 2.2489, 2.2489]],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79]
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [[2.4286, 2.4286, 2.4286, ..., 2.4286, 2.4286, 2.4286],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.4286, 2.4286, 2.4286, ..., 2.4286, 2.4286, 2.4286],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.4286, 2.4286, 2.4286, ..., 2.4286, 2.4286, 2.4286],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] ...,
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.4286, 2.4286, 2.4286, ..., 2.4286, 2.4286, 2.4286],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.4286, 2.4286, 2.4286, ..., 2.4286, 2.4286, 2.4286],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.4286, 2.4286, 2.4286, ..., 2.4286, 2.4286, 2.4286]],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79]
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [[2.6400, 2.6400, 2.6400, ..., 2.6400, 2.6400, 2.6400],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.6400, 2.6400, 2.6400, ..., 2.6400, 2.6400, 2.6400],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.6400, 2.6400, 2.6400, ..., 2.6400, 2.6400, 2.6400],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] ...,
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.6400, 2.6400, 2.6400, ..., 2.6400, 2.6400, 2.6400],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.6400, 2.6400, 2.6400, ..., 2.6400, 2.6400, 2.6400],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.6400, 2.6400, 2.6400, ..., 2.6400, 2.6400, 2.6400]]],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79]
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79]
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [[[2.2489, 2.2489, 2.2489, ..., 2.2489, 2.2489, 2.2489],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.2489, 2.2489, 2.2489, ..., 2.2489, 2.2489, 2.2489],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.2489, 2.2489, 2.2489, ..., 2.2489, 2.2489, 2.2489],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] ...,
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.2489, 2.2489, 2.2489, ..., 2.2489, 2.2489, 2.2489],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.2489, 2.2489, 2.2489, ..., 2.2489, 2.2489, 2.2489],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.2489, 2.2489, 2.2489, ..., 2.2489, 2.2489, 2.2489]],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79]
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [[2.4286, 2.4286, 2.4286, ..., 2.4286, 2.4286, 2.4286],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.4286, 2.4286, 2.4286, ..., 2.4286, 2.4286, 2.4286],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.4286, 2.4286, 2.4286, ..., 2.4286, 2.4286, 2.4286],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] ...,
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.4286, 2.4286, 2.4286, ..., 2.4286, 2.4286, 2.4286],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.4286, 2.4286, 2.4286, ..., 2.4286, 2.4286, 2.4286],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.4286, 2.4286, 2.4286, ..., 2.4286, 2.4286, 2.4286]],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79]
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [[2.6400, 2.6400, 2.6400, ..., 2.6400, 2.6400, 2.6400],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.6400, 2.6400, 2.6400, ..., 2.6400, 2.6400, 2.6400],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.6400, 2.6400, 2.6400, ..., 2.6400, 2.6400, 2.6400],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] ...,
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.6400, 2.6400, 2.6400, ..., 2.6400, 2.6400, 2.6400],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.6400, 2.6400, 2.6400, ..., 2.6400, 2.6400, 2.6400],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.6400, 2.6400, 2.6400, ..., 2.6400, 2.6400, 2.6400]]],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79]
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79]
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [[[2.2489, 2.2489, 2.2489, ..., 2.2489, 2.2489, 2.2489],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.2489, 2.2489, 2.2489, ..., 2.2489, 2.2489, 2.2489],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.2489, 2.2489, 2.2489, ..., 2.2489, 2.2489, 2.2489],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] ...,
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.2489, 2.2489, 2.2489, ..., 2.2489, 2.2489, 2.2489],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.2489, 2.2489, 2.2489, ..., 2.2489, 2.2489, 2.2489],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.2489, 2.2489, 2.2489, ..., 2.2489, 2.2489, 2.2489]],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79]
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [[2.4286, 2.4286, 2.4286, ..., 2.4286, 2.4286, 2.4286],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.4286, 2.4286, 2.4286, ..., 2.4286, 2.4286, 2.4286],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.4286, 2.4286, 2.4286, ..., 2.4286, 2.4286, 2.4286],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] ...,
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.4286, 2.4286, 2.4286, ..., 2.4286, 2.4286, 2.4286],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.4286, 2.4286, 2.4286, ..., 2.4286, 2.4286, 2.4286],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.4286, 2.4286, 2.4286, ..., 2.4286, 2.4286, 2.4286]],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79]
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [[2.6400, 2.6400, 2.6400, ..., 2.6400, 2.6400, 2.6400],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.6400, 2.6400, 2.6400, ..., 2.6400, 2.6400, 2.6400],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.6400, 2.6400, 2.6400, ..., 2.6400, 2.6400, 2.6400],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] ...,
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.6400, 2.6400, 2.6400, ..., 2.6400, 2.6400, 2.6400],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.6400, 2.6400, 2.6400, ..., 2.6400, 2.6400, 2.6400],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.6400, 2.6400, 2.6400, ..., 2.6400, 2.6400, 2.6400]]]])}],mm_hashes=['0228c0fd2f6238130d88376362365598ed63432c7353804797ccbb3ae6c2eb6f'],mm_positions=[PlaceholderRange(offset=194, length=770, is_embed=tensor([False, True, True, True, True, True, True, True, True, True,
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] True, True, True, True, True, True, True, True, True, True,
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] True, True, True, True, True, True, True, True, True, True,
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] True, True, True, True, True, True, True, True, True, True,
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] True, True, True, True, True, True, True, True, True, True,

[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.6400, 2.6400, 2.6400, ..., 2.6400, 2.6400, 2.6400],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.6400, 2.6400, 2.6400, ..., 2.6400, 2.6400, 2.6400],
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [dump_input.py:79] [2.6400, 2.6400, 2.6400, ..., 2.6400, 2.6400, 2.6400]]]])}],mm_hashes=['c3faa9dc69e19f088461439ce8bc8293bd629ae959e0a7b2ab0b64424d503816'],mm_positions=[PlaceholderRange(offset=194, length=3330, is_embed=tensor([False, True, True, ..., True, True, False]))],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=800, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None),block_ids=[[1, 2, 3, 4, 5, 6, 2526, 1267, 1965, 2910, 1949, 4983, 2893, 4735, 2800, 3191, 4957, 2742, 4020, 3639, 2683, 5638, 519, 1651, 5505, 929, 1652, 846, 251, 3276, 1382, 2355, 3936, 764, 5497, 5385, 2195, 286, 2423, 5192, 1983, 4390, 3292, 5006, 4682, 2754, 4119, 170, 182, 1035, 516, 1751, 4191, 262, 2409, 3369, 1903, 3578, 1457, 4501, 903, 3200, 105, 3895, 1788, 1197, 5191, 1127, 3586, 4560, 4890, 5544, 2205, 5364, 5389, 455, 3568, 1551, 2768, 3232, 4145, 14, 326, 3542, 4712, 200, 4288, 1099, 3491, 4465, 471, 4698, 5480, 3549, 3740, 1119, 3727, 2794, 678, 2211, 1872, 2868, 3426, 4287, 4135, 1262, 3221, 3428, 3534, 465, 4680, 2525, 2127]],num_computed_tokens=192,lora_request=None)], scheduled_cached_reqs=[CachedRequestData(req_id='chatcmpl-c115b21c0d2b4e028719e6b72c67cd6a', resumed_from_preemption=false, new_token_ids=[104317], new_block_ids=[[]], num_computed_tokens=1564)], num_scheduled_tokens={chatcmpl-5b80d5167e4544d4bf37d46389450721: 302, chatcmpl-c7820e224d85412db66590c4f3b75072: 3623, chatcmpl-7c15d323bcfa40deaa5555155c40c80b: 814, chatcmpl-cf8015881b6d405cb5fec9e22abc5ef5: 1863, chatcmpl-8705c061e797484393d0b532e817cfe3: 1842, chatcmpl-d0933e8106704ddfa5305c6780635551: 1064, chatcmpl-6bfb5fac8e1f4c8d9771a69105c01da9: 1855, chatcmpl-fed46a1420c845d58aa69c0cf712baa0: 338, chatcmpl-1ee9d1f4f8f94435b34e640718876c7b: 3478, chatcmpl-f8741509410849dcb5fd9c38f0b8dcf1: 2026, chatcmpl-76f0771924fc490e9774394227df4eca: 3388, chatcmpl-e6f7a10f443949b38c571f51867808c5: 3399, chatcmpl-a677a8af3e9f4be7aa196d29d5314532: 3402, chatcmpl-c115b21c0d2b4e028719e6b72c67cd6a: 1}, total_num_scheduled_tokens=27395, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={chatcmpl-5b80d5167e4544d4bf37d46389450721: [0], chatcmpl-d0933e8106704ddfa5305c6780635551: [0], chatcmpl-76f0771924fc490e9774394227df4eca: [0], chatcmpl-1ee9d1f4f8f94435b34e640718876c7b: [0], chatcmpl-6bfb5fac8e1f4c8d9771a69105c01da9: [0], chatcmpl-8705c061e797484393d0b532e817cfe3: [0], chatcmpl-a677a8af3e9f4be7aa196d29d5314532: [0], chatcmpl-cf8015881b6d405cb5fec9e22abc5ef5: [0], chatcmpl-f8741509410849dcb5fd9c38f0b8dcf1: [0], chatcmpl-c7820e224d85412db66590c4f3b75072: [0], chatcmpl-7c15d323bcfa40deaa5555155c40c80b: [0], chatcmpl-fed46a1420c845d58aa69c0cf712baa0: [0], chatcmpl-e6f7a10f443949b38c571f51867808c5: [0]}, num_common_prefix_blocks=[6], finished_req_ids=[], free_encoder_input_ids=[['chatcmpl-c115b21c0d2b4e028719e6b72c67cd6a', 0]], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] EngineCore encountered a fatal error.
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] Traceback (most recent call last):
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/ray/dag/compiled_dag_node.py", line 2531, in _execute_until
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] result = self._dag_output_fetcher.read(timeout)
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/ray/experimental/channel/common.py", line 309, in read
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] outputs = self._read_list(timeout)
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] ^^^^^^^^^^^^^^^^^^^^^^^^
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/ray/experimental/channel/common.py", line 400, in _read_list
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] raise e
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/ray/experimental/channel/common.py", line 382, in _read_list
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] result = c.read(min(remaining_timeout, iteration_timeout))
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/ray/experimental/channel/shared_memory_channel.py", line 776, in read
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] return self._channel_dict[self._resolve_actor_id()].read(timeout)
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/ray/experimental/channel/shared_memory_channel.py", line 612, in read
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] output = self._buffers[self._next_read_index].read(timeout)
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/ray/experimental/channel/shared_memory_channel.py", line 480, in read
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] ret = self._worker.get_objects(
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] ^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/ray/_private/worker.py", line 904, in get_objects
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] ] = self.core_worker.get_objects(
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] File "python/ray/_raylet.pyx", line 3196, in ray._raylet.CoreWorker.get_objects
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] File "python/ray/includes/common.pxi", line 106, in ray._raylet.check_status
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read. ObjectID: 00a0a2b2c6761c99ffbd3bfe290a79346e2465d2010000000be1f505
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502]
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] The above exception was the direct cause of the following exception:
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502]
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] Traceback (most recent call last):
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 493, in run_engine_core
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] engine_core.run_busy_loop()
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 520, in run_busy_loop
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] self._process_engine_step()
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 545, in _process_engine_step
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] outputs = self.step_fn()
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] ^^^^^^^^^^^^^^
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 226, in step
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] model_output = self.execute_model(scheduler_output)
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 213, in execute_model
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] raise err
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 207, in execute_model
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] return self.model_executor.execute_model(scheduler_output)
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/executor/ray_distributed_executor.py", line 57, in execute_model
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] return refs[0].get()
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] ^^^^^^^^^^^^^
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/ray/experimental/compiled_dag_ref.py", line 115, in get
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] self._dag._execute_until(
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/ray/dag/compiled_dag_node.py", line 2541, in _execute_until
[2025-07-29 14:08:48] 2025-07-29 14:08:47,837 INFO compiled_dag_node.py:2173 -- Tearing down compiled DAG
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] raise RayChannelTimeoutError(
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [core.py:502] ray.exceptions.RayChannelTimeoutError: System error: If the execution is expected to take a long time, increase RAY_CGRAPH_get_timeout which is currently 10 seconds. Otherwise, this may indicate that the execution is hanging.
[2025-07-29 14:08:48] INFO 07-29 14:08:47 [ray_distributed_executor.py:127] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [async_llm.py:408] AsyncLLM output_handler failed.
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [async_llm.py:408] Traceback (most recent call last):
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [async_llm.py:408] File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/engine/async_llm.py", line 366, in output_handler
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [async_llm.py:408] outputs = await engine_core.get_output_async()
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [async_llm.py:408] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [async_llm.py:408] File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 806, in get_output_async
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [async_llm.py:408] raise self._format_exception(outputs) from None
[2025-07-29 14:08:48] ERROR 07-29 14:08:47 [async_llm.py:408] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
[2025-07-29 14:08:48] 2025-07-29 14:08:47,837 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 1d4455cb1faf1b71a30cf38801000000)
[2025-07-29 14:08:48] 2025-07-29 14:08:47,838 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, bb9f1744bf2f1200cafb73f101000000)
[2025-07-29 14:08:48] 2025-07-29 14:08:47,838 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 5a33ec2b91235cb33da4822f01000000)
[2025-07-29 14:08:48] 2025-07-29 14:08:47,838 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 8c2f6bbe3b450dff3d6cf6dd01000000)
[2025-07-29 14:08:48] 2025-07-29 14:08:47,838 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, ffbd3bfe290a79346e2465d201000000)
[2025-07-29 14:08:48] 2025-07-29 14:08:47,838 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, c277541c672accdd2a6a195201000000)
[2025-07-29 14:08:48] 2025-07-29 14:08:47,838 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 878e8beadc90a046175f39df01000000)
[2025-07-29 14:08:48] 2025-07-29 14:08:47,838 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 9e69485269c2f30724737d0601000000)
[2025-07-29 14:08:48] INFO: 172.21.0.13:40058 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[2025-07-29 14:08:48] INFO: 172.21.0.13:40040 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[2025-07-29 14:08:48] INFO: 172.21.0.13:40052 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[2025-07-29 14:08:48] INFO: 172.21.0.13:40046 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[2025-07-29 14:08:48] INFO: 172.21.0.13:40062 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[2025-07-29 14:08:48] INFO: 172.21.0.13:40042 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[2025-07-29 14:08:48] INFO: 172.21.0.13:40036 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[2025-07-29 14:08:48] INFO: 172.21.0.13:40054 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[2025-07-29 14:08:48] INFO: 172.21.0.13:40050 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[2025-07-29 14:08:48] INFO: 172.21.0.13:40064 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[2025-07-29 14:08:48] INFO: 172.21.0.13:40230 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[2025-07-29 14:08:48] INFO: 172.21.0.13:40232 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[2025-07-29 14:08:48] INFO: 172.21.0.13:40234 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[2025-07-29 14:08:48] INFO: 172.21.0.13:40238 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[2025-07-29 14:08:48] INFO: 172.21.0.13:40240 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[2025-07-29 14:08:48] INFO: 172.21.0.13:40242 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[2025-07-29 14:08:48] 2025-07-29 14:08:47,846 INFO compiled_dag_node.py:2200 -- Waiting for worker tasks to exit
[2025-07-29 14:08:48] INFO: Shutting down
[2025-07-29 14:08:48] INFO: Waiting for application shutdown.
[2025-07-29 14:08:48] INFO: Application shutdown complete.
[2025-07-29 14:08:48] INFO: Finished server process [394]
[2025-07-29 14:08:48] *** SIGTERM received at time=1753769327 on cpu 52 ***
[2025-07-29 14:08:48] PC: @ 0x7fa2f5d247d1 (unknown) pthread_cond_timedwait@@GLIBC_2.3.2
[2025-07-29 14:08:48] @ 0x7fa2f5d29420 (unknown) (unknown)
[2025-07-29 14:08:48] @ ... and at least 1 more frames
[2025-07-29 14:08:48] [2025-07-29 14:08:47,999 E 462 462] logging.cc:496: *** SIGTERM received at time=1753769327 on cpu 52 ***
[2025-07-29 14:08:48] [2025-07-29 14:08:47,999 E 462 462] logging.cc:496: PC: @ 0x7fa2f5d247d1 (unknown) pthread_cond_timedwait@@GLIBC_2.3.2
[2025-07-29 14:08:48] [2025-07-29 14:08:47,999 E 462 462] logging.cc:496: @ 0x7fa2f5d29420 (unknown) (unknown)
[2025-07-29 14:08:48] [2025-07-29 14:08:47,999 E 462 462] logging.cc:496: @ ... and at least 1 more frames
[2025-07-29 14:08:49] 2025-07-29 14:08:48,031 INFO compiled_dag_node.py:2173 -- Tearing down compiled DAG
[2025-07-29 14:08:49] 2025-07-29 14:08:48,031 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 1d4455cb1faf1b71a30cf38801000000)
[2025-07-29 14:08:49] 2025-07-29 14:08:48,031 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, bb9f1744bf2f1200cafb73f101000000)
[2025-07-29 14:08:49] 2025-07-29 14:08:48,031 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 5a33ec2b91235cb33da4822f01000000)
[2025-07-29 14:08:49] 2025-07-29 14:08:48,031 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 8c2f6bbe3b450dff3d6cf6dd01000000)
[2025-07-29 14:08:49] 2025-07-29 14:08:48,031 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, ffbd3bfe290a79346e2465d201000000)
[2025-07-29 14:08:49] 2025-07-29 14:08:48,031 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, c277541c672accdd2a6a195201000000)
[2025-07-29 14:08:49] 2025-07-29 14:08:48,031 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 878e8beadc90a046175f39df01000000)
[2025-07-29 14:08:49] 2025-07-29 14:08:48,031 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 9e69485269c2f30724737d0601000000)
[2025-07-29 14:08:49] 2025-07-29 14:08:48,039 INFO compiled_dag_node.py:2200 -- Waiting for worker tasks to exit
[2025-07-29 14:08:49] 2025-07-29 14:08:48,039 INFO compiled_dag_node.py:2161 -- Killing actor: Actor(RayWorkerWrapper, 1d4455cb1faf1b71a30cf38801000000)
[2025-07-29 14:08:49] 2025-07-29 14:08:48,040 INFO compiled_dag_node.py:2161 -- Killing actor: Actor(RayWorkerWrapper, bb9f1744bf2f1200cafb73f101000000)
[2025-07-29 14:08:49] 2025-07-29 14:08:48,040 INFO compiled_dag_node.py:2161 -- Killing actor: Actor(RayWorkerWrapper, 5a33ec2b91235cb33da4822f01000000)
[2025-07-29 14:08:49] 2025-07-29 14:08:48,040 INFO compiled_dag_node.py:2161 -- Killing actor: Actor(RayWorkerWrapper, 8c2f6bbe3b450dff3d6cf6dd01000000)
[2025-07-29 14:08:49] 2025-07-29 14:08:48,040 INFO compiled_dag_node.py:2161 -- Killing actor: Actor(RayWorkerWrapper, ffbd3bfe290a79346e2465d201000000)
[2025-07-29 14:08:49] 2025-07-29 14:08:48,040 INFO compiled_dag_node.py:2161 -- Killing actor: Actor(RayWorkerWrapper, c277541c672accdd2a6a195201000000)
[2025-07-29 14:08:49] 2025-07-29 14:08:48,040 INFO compiled_dag_node.py:2161 -- Killing actor: Actor(RayWorkerWrapper, 878e8beadc90a046175f39df01000000)
[2025-07-29 14:08:49] 2025-07-29 14:08:48,040 INFO compiled_dag_node.py:2161 -- Killing actor: Actor(RayWorkerWrapper, 9e69485269c2f30724737d0601000000)
[2025-07-29 14:08:49] 2025-07-29 14:08:48,040 INFO compiled_dag_node.py:2203 -- Teardown complete
[2025-07-29 14:08:53] end training ...

Your logs confirm that the vLLM Ray backend is hitting a RayChannelTimeoutError after a successful first batch, then the engine and all Ray workers are killed, resulting in HTTP 500 errors for all subsequent requests. The error message specifically says: “If the execution is expected to take a long time, increase RAY_CGRAPH_get_timeout which is currently 10 seconds. Otherwise, this may indicate that the execution is hanging” (vLLM Issue #15102).

This means your distributed batch is taking longer than Ray’s default timeout, causing Ray to kill the workers and vLLM to crash. This is a known issue for large models and large batches with the Ray backend (vLLM Issue #15102, vLLM Issue #17881). The recommended fix is to set the environment variable RAY_CGRAPH_get_timeout (and optionally RAY_CGRAPH_submit_timeout) to a much higher value (e.g., 120 or 300 seconds) before launching vLLM. Example:

export RAY_CGRAPH_get_timeout=300
export RAY_CGRAPH_submit_timeout=300
vllm serve ... # your usual command

Would you like more detail on how to set these variables in Docker or Kubernetes, or on other stability tips for large distributed vLLM jobs?

Sources:

I tried replacing ray with mp
It did successfully increase the safe batch size from 8 (ray) to 64 (mp)
However, once I adjusted it further, my client encountered errors again, and the server received:

`INFO: 127.0.0.1:40886 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error`

Then the service crashed.

Switching from Ray to the mp (multiprocessing) backend often increases stability and allows larger batch sizes, but you are still hitting a server crash and HTTP 500 errors when the batch size is too large. This is most likely due to out-of-memory (OOM) on the GPU or CPU, or exceeding other resource limits (e.g., /dev/shm size, file descriptors), not a bug in vLLM itself (Troubleshooting Docs, Conserving Memory).

To resolve this, reduce max_num_seqs, max_num_batched_tokens, or batch size until the server is stable. You can also try lowering gpu_memory_utilization (e.g., to 0.9 or 0.8), and ensure /dev/shm is large enough (use --ipc=host in Docker). If you want to push the limits, monitor GPU/CPU memory usage and logs for OOM or resource errors (Troubleshooting Docs).

Would you like a step-by-step guide for tuning these parameters or for monitoring resource usage?

Sources:

On the same hardware, how much performance difference can be expected between CLI serve deployment and offline pipeline deployment?

在相同硬件下,vLLM的CLI serve(API服务)部署通常比离线pipeline(直接调用LLM类)推理慢,主要因为API服务涉及进程间通信和HTTP请求开销。根据官方issue,离线LLM类推理的吞吐量可比API服务高2-3倍,尤其在大批量推理时差距更明显。

如果只做单次推理或小批量,差距会缩小,但离线模式始终更快。API服务适合在线场景,离线模式适合批量高效推理。相关讨论也说明,API服务慢主要是因为包含了网络和序列化等额外开销。

需要更详细的性能对比或参数调优建议吗?

Sources:

1 Like

这两者精度会有差别吗?
为什么同样的提示词,基本上差不多的参数
测试下来的结果差别很大

两者(CLI serve部署与离线pipeline部署)在精度上理论上应无明显差别,只要模型权重、推理参数完全一致。但实际测试中,API服务和离线推理结果可能会有细微差异,主要原因是数值精度、批处理方式、并发调度等带来的浮点误差和非确定性,尤其是在float16/bfloat16等低精度下,这种差异会被放大,导致生成结果甚至logprobs有较大不同。官方FAQ明确说明vLLM不保证完全稳定的输出和logprobs。

此外,vLLM和transformers等实现间也会有类似现象,根本原因是底层算子、批处理、硬件等引入的微小数值不一致会逐步放大,最终影响生成内容。相关讨论官方说明均有详细解释。

需要更详细的原理分析或复现建议吗?

Sources:

那到底是cli serve精度高还是offline pipeline精度高

两者精度本质上没有高低之分,只要模型权重和推理参数一致,CLI serve和offline pipeline的理论输出精度应完全相同。实际推理时,离线pipeline通常更接近“理想”精度,因为它少了网络、进程间通信和批处理调度等额外不确定性因素,数值误差更小。官方FAQ说明,API serve模式更容易因并发、批处理等引入浮点误差,导致结果偶有差异。

如果追求最大可复现性和一致性,建议优先用offline pipeline离线推理。但两者都不保证100%数值一致,差异主要来自实现细节和硬件环境,而非“精度高低”。需要更详细的原理或复现建议吗?

Sources:

你看下我两的脚本的运行方式,精度远低于cli serve和transformer推理,帮我找出原因

pipeline脚本:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Stage-3 Valid – InternVL-78B 本地 vLLM 批推理版本
=================================================
相比原来的 client/HTTP 版本,本脚本直接在本地通过 vLLM 引擎调用 `LLM.generate()`,
可显著提升推理吞吐。

核心流程保持不变:
1. 逐条构建提示词 (宽松评估标准)
2. 批量调用 vLLM 生成判断
3. 解析判断 → 置信度 / 正确与否
4. 保存 valid / invalid / report

仅保留最关键功能,去掉远程 API / checkpoint,保持脚本精简易跑。
"""

from __future__ import annotations

import os
import sys
import argparse
import logging
import json
import time
from dataclasses import dataclass
from typing import List, Dict, Tuple

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams, EngineArgs  # type: ignore

# ---------------------------------------------------------------------------
# 共用结构 – 直接在本文件中重新实现所需工具
# ---------------------------------------------------------------------------

import re


@dataclass
class ValidationResult:
    """单条验证输出"""

    judgment: str
    confidence: float
    is_correct: bool
    raw_output: str
    reasoning: str


class JudgmentParser:
    """解析模型自然语言输出,得到宽松判定"""

    correct_indicators = {
        "正确",
        "基本正确",
        "大致正确",
        "部分正确",
        "大部分正确",
        "总体正确",
        "符合要求",
        "accurate",
        "mostly correct",
        "partially correct",
        "true",
        "yes",
    }

    @staticmethod
    def parse_judgment(resp: str) -> Tuple[bool, float, str, str]:
        if not resp:
            return False, 0.0, "无响应", ""

        clean = JudgmentParser._clean(resp)
        judgment = JudgmentParser._extract_judgment(clean)
        confidence = JudgmentParser._extract_confidence(clean)
        reasoning = JudgmentParser._extract_reasoning(clean)
        is_correct = JudgmentParser._is_correct(judgment)
        return is_correct, confidence, judgment, reasoning

    @staticmethod
    def _clean(text: str) -> str:
        text = re.sub(r"<\|.*?\|>", "", text)
        return re.sub(r"\s+", " ", text).strip()

    @staticmethod
    def _extract_judgment(text: str) -> str:
        m = re.search(r"判断[::]\s*(.+?)(?:\n|置信度|理由|$)", text, re.I | re.S)
        return m.group(1).strip() if m else text

    @staticmethod
    def _extract_confidence(text: str) -> float:
        m = re.search(r"置信度[::]\s*(\d*\.?\d+)", text)
        if m:
            try:
                return min(float(m.group(1)), 1.0)
            except Exception:
                pass
        # 粗略估计
        lower = text.lower()
        if any(k in lower for k in ("完全正确", "absolutely", "显然")):
            return 0.95
        if "部分" in text or "partially" in lower:
            return 0.6
        return 0.7

    @staticmethod
    def _extract_reasoning(text: str) -> str:
        m = re.search(r"理由[::]\s*(.+)", text, re.I | re.S)
        return m.group(1).strip() if m else "无详细理由"

    @staticmethod
    def _is_correct(judgment: str) -> bool:
        j = judgment.lower()
        if any(ind in j for ind in JudgmentParser.correct_indicators):
            return not any(neg in j for neg in ("不正确", "错误", "incorrect"))
        return False


# ---------------- 数据集 I/O ----------------


def load_dataset(path: str) -> List[Dict]:
    logger.info(f"加载数据集: {path}")
    try:
        with open(path, "r", encoding="utf-8") as f:
            data = json.load(f)
    except Exception as e:
        logger.error(f"读取失败: {e}")
        sys.exit(1)

    if not isinstance(data, list):
        logger.error("顶层应为列表")
        sys.exit(1)

    processed: List[Dict] = []
    for item in data:
        if not isinstance(item, dict):
            continue

        # 兼容字段
        image = item.get("image") or item.get("image_path")
        q = item.get("question") or item.get("problem")
        a = item.get("answer") or item.get("gt_answer")

        if not (image and q and a):
            continue

        processed.append({"image": image, "question": q, "answer": a, **item})

    logger.info(f"标准化后样本数: {len(processed)}/{len(data)}")
    return processed


def save_json(path: str, data: List[Dict]):
    os.makedirs(os.path.dirname(path), exist_ok=True)
    with open(path, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)
    logger.info(f"保存 {len(data)} 条 → {path}")


def clean_validation_data(data: List[Dict]) -> List[Dict]:
    return [{k: v for k, v in item.items() if k != "gt_valid"} for item in data]


def generate_report(correct: List[Dict], incorrect: List[Dict], out_dir: str, fname: str, start_time: float, api: str):
    total = len(correct) + len(incorrect)
    duration = time.time() - start_time
    report = (
        f"Stage-3 Valid 报告\n====================\n"
        f"文件: {fname}\n总样本: {total}\n"
        f"正确: {len(correct)}  ({len(correct)/total*100:.2f}%)\n"
        f"错误: {len(incorrect)} ({len(incorrect)/total*100:.2f}%)\n"
        f"耗时: {duration:.2f}s\n"
        f"模式: 本地 vLLM pipeline\n"
    )
    path = os.path.join(out_dir, f"{fname}_report.txt")
    with open(path, "w", encoding="utf-8") as f:
        f.write(report)
    logger.info(f"报告已保存 → {path}")


logger = logging.getLogger(__name__)
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
)

# ---------------------------------------------------------------------------
# 配置
# ---------------------------------------------------------------------------
@dataclass
class ValidationConfig:
    model_path: str = "/mnt/afs/share_models/git_models/OpenGVLab/InternVL3-78B"
    batch_size: int = 128
    max_new_tokens: int = 800
    temperature: float = 0.0
    top_p: float = 1.0
    gpu_util: float = 0.85  # 依据显存情况微调

# ---------------------------------------------------------------------------
# 本地 vLLM 推理器
# ---------------------------------------------------------------------------
class InternVLPipelineInference:
    """本地 vLLM 批推理封装。"""

    def __init__(self, cfg: ValidationConfig):
        self.cfg = cfg

        logger.info("初始化 tokenizer …")
        self.tokenizer = AutoTokenizer.from_pretrained(cfg.model_path, trust_remote_code=True)

        logger.info("初始化 vLLM 引擎 … (耗时 1-3 分钟)")
        eng_args = EngineArgs(
            model=cfg.model_path,
            tensor_parallel_size=max(1, int(os.environ.get("WORLD_SIZE", 1))),
            max_model_len=4096,
            gpu_memory_utilization=cfg.gpu_util,
            trust_remote_code=True,
        )
        self.llm = LLM(**eng_args.__dict__)

        stop_tokens = ["<|endoftext|>", "<|im_end|>"]
        stop_token_ids = [self.tokenizer.convert_tokens_to_ids(tok) for tok in stop_tokens]
        self.sp = SamplingParams(
            temperature=cfg.temperature,
            top_p=cfg.top_p,
            max_tokens=cfg.max_new_tokens,
            stop_token_ids=stop_token_ids,
            skip_special_tokens=True,
        )

    # -------------------- prompt 构建 --------------------
    def _build_prompt(self, sample: Dict) -> str:
        question = sample.get("question", "")
        answer = sample.get("answer", "")
        image_path = sample.get("image", "")

        sys_prompt = (
            "你是一个严谨的、注重事实的多模态问答评估员。请基于图片内容,判断\"待判断答案\"是否准确地回答了\"问题\"。\n\n"
            "**要求:**\n"
            "1. **事实为先**: 你的判断必须严格基于图片内容,不能有任何想象或推断。\n"
            "2. **宽松标准**: '完全正确'、'基本正确'、'大致正确'、'部分正确'都视为【正确】。只有'完全错误'、'严重错误'、'明显不符'才视为【错误】。\n"
            "3. **输出格式必须如下**:\n\n"
            "判断:[正确/基本正确/部分正确/错误]\n"
            "置信度:[0.0-1.0]\n"
            "理由:[简要说明判断依据,指出答案中正确或错误的关键点]\n\n"
        )

        messages = [
            {
                "role": "user",
                "content": f"<image>\n{sys_prompt}\n<image>\n--- 以下是待评估内容 ---\n问题:{question}\n待判断答案:{answer}"
            }
        ]
        prompt = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        # 插入图片路径
        prompt = prompt.replace("<image>", f"<image>{image_path}</image>")
        return prompt

    # -------------------- 核心批推理 --------------------
    def validate_batch(self, batch_data: List[Dict]) -> List[ValidationResult]:
        prompts = [self._build_prompt(s) for s in batch_data]
        outputs = self.llm.generate(prompts, self.sp)

        results: List[ValidationResult] = []
        for out in outputs:
            raw = out.outputs[0].text.strip()
            is_corr, conf, judgment, reasoning = JudgmentParser.parse_judgment(raw)
            results.append(
                ValidationResult(
                    judgment=judgment,
                    confidence=conf,
                    is_correct=is_corr,
                    raw_output=raw,
                    reasoning=reasoning,
                )
            )
        return results

# ---------------------------------------------------------------------------
# 入口逻辑 (简化,无 checkpoint)
# ---------------------------------------------------------------------------

def parse_args():
    p = argparse.ArgumentParser("Stage-3 Valid – 本地 vLLM Pipeline 版")
    p.add_argument("--input", required=True, help="输入 JSON 文件")
    p.add_argument(
        "--output",
        default="/mnt/afs/smartbrain/zhukangkang/code/common_data_pipeline/stage1_cls/output_stage1_pipeline",
        help="输出根目录 (默认: /mnt/afs/smartbrain/zhukangkang/code/common_data_pipeline/stage1_cls/output_stage1_pipeline)",
    )
    p.add_argument("--model", default="/mnt/afs/share_models/git_models/OpenGVLab/InternVL3-78B", help="模型路径")
    p.add_argument("--batch-size", type=int, default=128, help="批推理大小")
    p.add_argument("--gpu-util", type=float, default=0.85, help="GPU 显存利用率上限")
    p.add_argument("--debug", action="store_true")
    return p.parse_args()


def main():
    args = parse_args()
    if args.debug:
        logging.getLogger().setLevel(logging.DEBUG)

    cfg = ValidationConfig(
        model_path=args.model,
        batch_size=args.batch_size,
        gpu_util=args.gpu_util,
    )

    inference = InternVLPipelineInference(cfg)

    data = load_dataset(args.input)
    if not data:
        logger.error("数据为空,退出")
        sys.exit(1)

    # 输出目录遵循旧脚本逻辑: <output_root>/<input_filename>/
    input_fname = os.path.splitext(os.path.basename(args.input))[0]
    run_out_dir = os.path.join(args.output, input_fname)
    os.makedirs(run_out_dir, exist_ok=True)

    correct: List[Dict] = []
    incorrect: List[Dict] = []

    total = len(data)
    logger.info(f"共有 {total} 条样本,batch_size={cfg.batch_size}")

    for i in range(0, total, cfg.batch_size):
        batch = data[i : i + cfg.batch_size]
        logger.info(f"处理 {i+1}‒{min(i+cfg.batch_size, total)} …")
        results = inference.validate_batch(batch)
        for sample, res in zip(batch, results):
            sample["gt_valid"] = {
                "validation": {
                    "model": "InternVL-78B-pipeline",
                    "judgment": res.judgment,
                    "confidence": res.confidence,
                    "reasoning": res.reasoning,
                    "is_correct": res.is_correct,
                    "raw_output": res.raw_output,
                }
            }
            (correct if res.is_correct else incorrect).append(sample)

    # 保存,与老脚本命名保持一致
    valid_path = os.path.join(run_out_dir, f"{input_fname}_valid.json")
    invalid_path = os.path.join(run_out_dir, f"{input_fname}_invalid.json")

    save_json(valid_path, clean_validation_data(correct))
    save_json(invalid_path, incorrect)

    # 报告
    generate_report(correct, incorrect, run_out_dir, input_fname, time.time(), "local-pipeline")

    logger.info("✅ 全部完成!")


if __name__ == "__main__":
    main()

你的pipeline脚本本身没有明显的“精度低”问题,但和CLI serve/transformers推理结果差异很大,常见根因有以下几类:

  1. max_model_len设置过小:你pipeline里max_model_len=4096,而CLI serve/transformers可能用的是32768或更大。对于长输入,截断会导致模型“看不到”全部上下文,极大影响推理准确率。相关说明

  2. tokenizer配置不一致AutoTokenizer.from_pretrained(cfg.model_path, trust_remote_code=True),但如果CLI serve/transformers用的是不同的tokenizer(比如有自定义chat_template、特殊token等),会导致分词、特殊符号处理不同,影响prompt和输出。相关讨论

  3. prompt构造方式不同:你用apply_chat_template+手动插入,而CLI serve/transformers可能用的是不同的prompt模板或API,导致输入内容有差异。

  4. 推理参数不一致:如temperaturetop_pmax_tokensstop_token_ids等,任一参数不同都可能导致生成内容大幅变化。

  5. 多卡/并发/批处理差异:vLLM的批处理、并发调度、float16/bfloat16精度等,都会引入非确定性,导致结果和transformers单卡推理不同。官方FAQ

建议排查顺序:

  • 确认pipeline和CLI serve/transformers的max_model_len、tokenizer、prompt模板、推理参数完全一致。
  • 检查输入数据是否有截断、特殊符号丢失等问题。
  • 若用多卡,尝试单卡、float32精度复现,排除并发和精度误差。
  • 对比实际输入prompt和tokenizer分词结果,确保完全一致。

如需更详细的排查步骤或代码对比方法,请告知。

Sources:

我的cli脚本

客户端:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Stage-3 Valid InternVL-78B 验证脚本(vLLM批推理版本)
使用vLLM的原生批推理格式进行高效验证
-----------------------------------------------------
功能:
1. 一步验证:对所有数据进行宽松标准验证(部分正确也算正确)
2. vLLM批推理:使用vLLM原生格式进行真正的批量推理
3. 资源隔离:纯CPU运行,不占用本地GPU资源
4. 高效处理:真正的批量处理,提升推理效率

验证流程:
- 宽松验证:使用宽松标准判断答案正确性,只有明显错误才判定为失败

输出结构:
- output_stage3_valid/
  - {input_filename}/                    # 以输入文件名命名的文件夹
    ├── {input_filename}_valid.json     # 最终验证通过的数据
    ├── {input_filename}_invalid.json   # 最终验证失败的数据  
    ├── {input_filename}_report.txt     # 详细报告
    └── checkpoints/                    # 检查点文件夹

使用示例:
# 远程vLLM服务需要手动指定地址和端口

# 方式1:使用完整API URL(推荐)
python stage3_valid.py --input /path/to/input.json --api-url http://172.21.0.36:6666

# 方式2:分别指定主机和端口
python stage3_valid.py --input /path/to/input.json --host 172.21.0.36 --port 6666

# 查看现有检查点状态:
python stage3_valid.py --input /path/to/input.json --api-url http://172.21.0.36:6666 --list-checkpoints

# 清理检查点重新开始:
python stage3_valid.py --input /path/to/input.json --api-url http://172.21.0.36:6666 --clean-checkpoints

# 禁用断点恢复,强制从头开始(不删除检查点):
python stage3_valid.py --input /path/to/input.json --api-url http://172.21.0.36:6666 --no-resume

# 测试模式(先运行这个验证环境):
python stage3_valid.py --input /path/to/input.json --host 172.21.0.36 --port 6666 --test-mode --batch-size 4 --debug

# 完整运行(推荐参数):
python stage3_valid.py --input /path/to/input.json --api-url http://172.21.0.36:6666 --batch-size 16 --checkpoint-interval 50

检查点功能说明:
- 自动创建:每处理指定数量样本或每10个批次自动创建检查点
- 断点恢复:程序中断后重新运行,自动从最新检查点继续
- 检查点类型:
  * validation_progress: 按样本数量间隔创建的进度检查点
  * validation_batch: 按批次间隔创建的批次检查点  
  * validation_interrupted: 用户中断时创建的紧急检查点
  * final: 验证完成时的最终检查点
- 自动清理:保留最新3个检查点,自动删除旧的检查点文件
- 无缝恢复:从检查点恢复的结果与一次性运行完全相同
"""

import os
import sys
from typing import Tuple, Optional, List, Dict

# 环境变量设置 - 强制CPU运行,避免占用GPU资源
os.environ['RAY_DISABLE_IMPORT_WARNING'] = '1'
os.environ['RAY_ADDRESS'] = 'local'
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
os.environ['CUDA_VISIBLE_DEVICES'] = ''  # 强制不使用GPU
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:0'  # 避免CUDA内存分配
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'  # 避免CUDA相关操作

def _get_display_gpus() -> str:
    """vLLM批推理模式下仅用于显示的GPU信息"""
    return "CPU模式(使用远程vLLM)"

def set_cuda_devices_early():
    """解析命令行参数(vLLM批推理模式下主要用于兼容性)"""
    import argparse
    parser = argparse.ArgumentParser(add_help=False)
    parser.add_argument('--gpus', type=str, default="CPU模式", help='使用的GPU设备(CPU模式下仅用于显示)')
    parser.add_argument('--help', action='store_true', help='显示帮助信息')
    args, _ = parser.parse_known_args()

    if args.help:
        return "CPU模式"

    gpus = args.gpus
    
    # CPU模式下不需要GPU设置
    if not gpus or gpus.lower() == 'auto':
        gpus = "CPU模式(使用远程vLLM)"

    print(f"[运行模式] 纯CPU模式:{gpus}")
    return gpus

# 在导入其他库之前设置GPU(vLLM批推理模式下主要用于显示)
SELECTED_GPUS = set_cuda_devices_early()

import json
import logging
import time
import argparse
import re
import signal
import requests
import concurrent.futures
import threading
from dataclasses import dataclass
from tqdm import tqdm
from PIL import Image
import io
import base64
from requests.adapters import HTTPAdapter

# ---------------------------------------------------------------------------
# 日志配置
# ---------------------------------------------------------------------------
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
)
logger = logging.getLogger(__name__)

# ---------------------------------------------------------------------------
# 数据结构定义
# ---------------------------------------------------------------------------
@dataclass
class ValidationResult:
    """验证结果"""
    judgment: str
    confidence: float
    is_correct: bool
    raw_output: str
    reasoning: str

# ---------------------------------------------------------------------------
# 配置类
# ---------------------------------------------------------------------------
class ValidationConfig:
    """验证配置(vLLM批推理版本)"""
    # API配置
    API_URL = None  # vLLM API服务地址(通过命令行参数传入)
    API_TIMEOUT = 180   # API请求超时时间(秒,批推理需要更长时间)
    
    # 批推理配置
    BATCH_SIZE = 64  # 批次大小(vLLM批推理的真正批次);与远程 vLLM 的 --max-num-seqs 对齐
    MAX_NEW_TOKENS = 800  # 最大输出长度
    TEMPERATURE = 0.0  # 温度参数
    TOP_P = 1.0  # top_p参数
    DO_SAMPLE = False  # 是否采样
    
    # 重试配置
    REQUEST_RETRY_TIMES = 3  # 请求重试次数
    RETRY_DELAY = 2.0  # 重试延迟(秒)

# ---------------------------------------------------------------------------
# 判断结果解析器
# ---------------------------------------------------------------------------
class JudgmentParser:
    """判断结果解析器"""
    
    @staticmethod
    def parse_judgment(response: str) -> Tuple[bool, float, str, str]:
        """
        解析模型输出的判断结果
        返回: (是否正确, 置信度, 判断文本, 推理过程)
        """
        if not response:
            return False, 0.0, "无响应", ""
        
        # 清理输出
        cleaned_response = JudgmentParser._clean_response(response)
        
        # 提取信息
        judgment_text = JudgmentParser._extract_judgment(cleaned_response)
        confidence = JudgmentParser._extract_confidence(cleaned_response)
        reasoning = JudgmentParser._extract_reasoning(cleaned_response)
        is_correct = JudgmentParser._determine_correctness(judgment_text)
        
        return is_correct, confidence, judgment_text, reasoning
    
    @staticmethod
    def _clean_response(response: str) -> str:
        """清理响应文本"""
        response = re.sub(r'<\|.*?\|>', '', response)
        response = re.sub(r'\s+', ' ', response).strip()
        return response
    
    @staticmethod
    def _extract_judgment(response: str) -> str:
        """提取判断文本"""
        # 尝试匹配结构化输出
        judgment_match = re.search(r'判断[::]\s*(.+?)(?:\n|置信度|理由|$)', response, re.IGNORECASE | re.DOTALL)
        if judgment_match:
            return judgment_match.group(1).strip()
        
        return response
    

    @staticmethod
    def _extract_confidence(response: str) -> float:
        """提取置信度"""
        conf_match = re.search(r'置信度[::]\s*(\d*\.?\d+)', response, re.IGNORECASE)
        if conf_match:
            try:
                confidence = float(conf_match.group(1))
                return min(confidence, 1.0)
            except (ValueError, IndexError):
                pass
        
        # 如果没有明确置信度,根据判断词汇估算
        response_lower = response.lower()
        if any(word in response_lower for word in ['确定', '显然', '完全正确', '完全错误']):
            return 0.95
        elif any(word in response_lower for word in ['基本正确', '核心意图成功']):
            return 0.8
        elif any(word in response_lower for word in ['部分正确', '可能', '似乎']):
            return 0.6
        return 0.7 # 默认值

    @staticmethod
    def _extract_reasoning(response: str) -> str:
        """提取理由"""
        # 匹配理由
        reason_match = re.search(r'理由[::]\s*(.+)', response, re.IGNORECASE | re.DOTALL)
        if reason_match:
            return reason_match.group(1).strip()
        
        return "无详细理由"

    @staticmethod
    def _determine_correctness(judgment_text: str) -> bool:
        """
        根据判断文本决定是否正确
        宽松标准:部分正确也算正确
        """
        judgment_lower = judgment_text.lower().strip()
        
        # 定义正确的指示词
        # 宽松标准:部分正确也算正确
        correct_indicators = ['正确', '基本正确', '大致正确', '主要正确', '部分正确', 
                              '大部分正确', '总体正确', '符合要求', 'correct', 
                              'mostly correct', 'partially correct', 'generally correct',
                              'true', 'yes', '对', '准确', '合理', '可以接受']
        
        if any(indicator in judgment_lower for indicator in correct_indicators):
            # 确保不是"不正确"等否定形式
            if not any(neg in judgment_lower for neg in ['不正确', 'incorrect', '错误']):
                return True
                
        return False

# ---------------------------------------------------------------------------
# 图像处理工具
# ---------------------------------------------------------------------------
def load_image_from_path(image_path: str) -> Image.Image:
    """从路径加载PIL图像"""
    if image_path.startswith('http'):
        # 网络图片,下载后加载
        response = requests.get(image_path, timeout=30)
        response.raise_for_status()
        image = Image.open(io.BytesIO(response.content))
    else:
        # 本地图片,直接加载
        image = Image.open(image_path)
    
    return image

# ---------------------------------------------------------------------------
# 模型推理器(vLLM批推理版本)
# ---------------------------------------------------------------------------
class InternVLInference:
    """InternVL vLLM批推理器"""
    
    def __init__(self, config: ValidationConfig, api_url: Optional[str] = None):
        self.config = config
        self.api_url = api_url or config.API_URL
        # 复用 HTTP keep-alive,降低连接开销
        self.session = requests.Session()
        # 动态设置连接池大小为 batch_size
        adapter = HTTPAdapter(pool_connections=self.config.BATCH_SIZE, pool_maxsize=self.config.BATCH_SIZE)
        self.session.mount('http://', adapter)
        self.session.mount('https://', adapter)
        
        self._test_api_connection()
    
    def _test_api_connection(self):
        """测试API连接"""
        logger.info(f"正在测试vLLM API连接: {self.api_url}")
        
        try:
            # 测试基本连接
            health_url = f"{self.api_url.rstrip('/')}/health"
            response = self.session.get(health_url, timeout=10)
            
            if response.status_code == 200:
                logger.info("✅ vLLM API健康检查通过")
            else:
                logger.warning(f"⚠️ API健康检查返回状态码: {response.status_code}")
        
        except requests.exceptions.RequestException as e:
            logger.error(f"❌ vLLM API连接测试失败: {e}")
            logger.error(f"请确保vLLM API服务正在运行于 {self.api_url}")
            logger.error("如果服务未启动,请使用类似命令启动:")
            
            # 从URL中提取端口号用于示例
            import re
            from urllib.parse import urlparse
            try:
                parsed_url = urlparse(self.api_url)
                port = parsed_url.port or 6666
                logger.error(f"python -m vllm.entrypoints.openai.api_server --model /mnt/afs/share_models/git_models/OpenGVLab/InternVL3-78B --port {port} --tensor-parallel-size 8")
            except Exception:
                logger.error("python -m vllm.entrypoints.openai.api_server --model /mnt/afs/share_models/git_models/OpenGVLab/InternVL3-78B --port <PORT> --tensor-parallel-size 8")
            
            sys.exit(1)
        
        # 测试模型信息
        try:
            models_url = f"{self.api_url.rstrip('/')}/v1/models"
            response = self.session.get(models_url, timeout=10)
            
            if response.status_code == 200:
                models_info = response.json()
                logger.info(f"可用模型: {models_info}")
            else:
                logger.warning(f"获取模型信息失败: {response.status_code}")
        
        except Exception as e:
            logger.warning(f"获取模型信息时出错: {e}")
        
        logger.info(f"✅ vLLM API连接配置完成: {self.api_url}")
    
    def validate_batch(self, batch_data: List[Dict]) -> List[ValidationResult]:
        """vLLM批量验证"""
        if not batch_data:
            return []
        
        batch_id = batch_data[0].get('id', 'unknown')
        logger.info(f"[vLLM-BATCH] 开始处理批次,大小={len(batch_data)},首个样本ID={batch_id}")
        
        logger.info(f"[vLLM-BATCH] 开始准备vLLM批推理数据...")
        batch_requests = []
        valid_samples = 0
        invalid_samples = 0
        sample_ids = []
        
        for i, sample in enumerate(batch_data):
            sample_id = sample.get("id", f"sample_{i}")
            image_path = sample.get("image", "")
            
            # 预检查图像路径
            if not image_path:
                logger.warning(f"[vLLM-BATCH] 样本{sample_id}: 缺少图像路径")
                batch_requests.append(None)
                sample_ids.append(sample_id)
                invalid_samples += 1
                continue
            
            # 检查文件是否存在(对于本地文件)
            if not image_path.startswith('http') and not os.path.exists(image_path):
                logger.warning(f"[vLLM-BATCH] 样本{sample_id}: 图像路径无效 {image_path}")
                batch_requests.append(None)
                sample_ids.append(sample_id)
                invalid_samples += 1
                continue

            try:
                logger.debug(f"[vLLM-BATCH] 处理样本{sample_id}: 构建提示词...")
                # 构建vLLM格式的提示词
                prompt = self._build_vllm_prompt(sample)
                
                logger.debug(f"[vLLM-BATCH] 样本{sample_id}: 加载图像 {image_path[:50]}...")
                # 加载PIL图像
                image = load_image_from_path(image_path)
                logger.debug(f"[vLLM-BATCH] 样本{sample_id}: 图像加载完成")
                
                # 准备vLLM批推理格式的请求数据
                request_data = {
                    "prompt": prompt,
                    "multi_modal_data": {"image": image},
                }
                
                batch_requests.append(request_data)
                sample_ids.append(sample_id)
                valid_samples += 1
                
            except Exception as e:
                logger.error(f"[vLLM-BATCH] 样本{sample_id}: 数据准备失败 - {e}")
                batch_requests.append(None)
                sample_ids.append(sample_id)
                invalid_samples += 1
                continue
        
        logger.info(f"[vLLM-BATCH] 数据准备完成: 有效={valid_samples}, 无效={invalid_samples}")
        
        # vLLM批量推理调用
        try:
            logger.info(f"[vLLM-INFERENCE] 开始vLLM批推理,有效样本数={valid_samples}")
            
            # 只对有效样本进行批推理
            valid_requests = [req for req in batch_requests if req is not None]
            
            if valid_requests:
                logger.info(f"[vLLM-INFERENCE] 准备发送vLLM批推理请求,数量={len(valid_requests)}")
                responses = self._vllm_batch_inference(valid_requests)
            else:
                logger.warning(f"[vLLM-INFERENCE] 没有有效样本,跳过批推理")
                responses = []
                
        except Exception as e:
            logger.error(f"[vLLM-INFERENCE] vLLM批推理失败: {e}", exc_info=True)
            return [self._create_error_result(f"vLLM批推理失败: {e}") for _ in batch_data]

        # 解析结果
        logger.info(f"[vLLM-PARSE] 开始解析vLLM批推理结果,响应数={len(responses) if responses else 0}")
        results = []
        response_idx = 0
        
        for i, (request_data, sample_id) in enumerate(zip(batch_requests, sample_ids)):
            if request_data is None:  # 处理之前标记的无效样本
                logger.debug(f"[vLLM-PARSE] 样本{sample_id}: 跳过无效样本")
                results.append(self._create_error_result(f"数据准备失败"))
                continue

            try:
                logger.debug(f"[vLLM-PARSE] 样本{sample_id}: 开始解析响应")
                
                # 获取对应的响应
                if response_idx >= len(responses):
                    logger.error(f"[vLLM-PARSE] 样本{sample_id}: 响应索引{response_idx}超出范围{len(responses)}")
                    results.append(self._create_error_result("响应索引超出范围"))
                    continue
                    
                raw_output = responses[response_idx]
                response_idx += 1
                
                logger.debug(f"[vLLM-PARSE] 样本{sample_id}: 响应长度={len(raw_output)}字符")
                
                is_correct, confidence, judgment, reasoning = JudgmentParser.parse_judgment(raw_output)
                
                logger.debug(f"[vLLM-PARSE] 样本{sample_id}: 判断结果={judgment}, 置信度={confidence}, 正确性={is_correct}")
                
                result = ValidationResult(
                    judgment=judgment,
                    confidence=confidence,
                    is_correct=is_correct,
                    raw_output=raw_output,
                    reasoning=reasoning
                )
                results.append(result)
            
            except Exception as e:
                logger.error(f"[vLLM-PARSE] 解析样本{sample_id}的响应时出错: {e}", exc_info=True)
                results.append(self._create_error_result(f"响应解析失败: {e}"))
        
        logger.info(f"[vLLM-PARSE] 结果解析完成: 总结果={len(results)}")
        success_count = sum(1 for r in results if r.is_correct)
        logger.info(f"[vLLM-PARSE] 批次统计: 成功={success_count}/{len(responses)}")
        
        return results
    
    def _vllm_batch_inference(self, valid_requests: List[Dict]) -> List[str]:
        """vLLM批推理调用 - 优化版本,支持真正的批量请求"""
        if not valid_requests:
            return []
        
        api_url = f"{self.api_url.rstrip('/')}/v1/chat/completions"
        
        logger.info(f"[vLLM-API] 使用vLLM批推理处理 {len(valid_requests)} 个请求")
        
        # 根据vLLM服务的max_num_seqs参数调整批次大小
        # 您的启动参数中设置了 --max-num-seqs 32
        max_batch_size = min(32, len(valid_requests))  # 使用服务端配置的最大序列数
        
        all_responses = []
        
        # 分批处理,每批最多32个请求(根据您的vLLM服务配置)
        for batch_start in range(0, len(valid_requests), max_batch_size):
            batch_end = min(batch_start + max_batch_size, len(valid_requests))
            current_batch = valid_requests[batch_start:batch_end]
            
            logger.info(f"[vLLM-API] 处理批次 {batch_start//max_batch_size + 1},大小: {len(current_batch)}")
            
            batch_responses = self._process_single_batch(current_batch, api_url)
            all_responses.extend(batch_responses)
        
        return all_responses
    
    def _process_single_batch(self, batch_requests: List[Dict], api_url: str) -> List[str]:
        """处理单个批次的请求"""
        for retry_count in range(self.config.REQUEST_RETRY_TIMES):
            try:
                logger.debug(f"[vLLM-API] 批次尝试 {retry_count + 1}/{self.config.REQUEST_RETRY_TIMES}")
                
                # 准备批量请求 - 使用线程池并发处理
                
                responses = [None] * len(batch_requests)
                responses_lock = threading.Lock()
                
                def single_request(index, request_data):
                    try:
                        prompt = request_data["prompt"]
                        image = request_data["multi_modal_data"]["image"]
                        
                        # 将PIL图像转换为base64
                        buffered = io.BytesIO()
                        
                        # 处理RGBA图像,转换为RGB
                        if image.mode == 'RGBA':
                            # 创建白色背景
                            background = Image.new('RGB', image.size, (255, 255, 255))
                            background.paste(image, mask=image.split()[-1])  # 使用alpha通道作为mask
                            image = background
                        elif image.mode not in ['RGB', 'L']:
                            # 确保图像是RGB或灰度模式
                            image = image.convert('RGB')
                        
                        image.save(buffered, format="JPEG")
                        img_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')
                        
                        # 单个请求的payload
                        single_payload = {
                            "model": "Qwen2.5-VL-72B-Instruct",
                            "messages": [
                                {
                                    "role": "user",
                                    "content": [
                                        {"type": "text", "text": prompt},
                                        {
                                            "type": "image_url",
                                            "image_url": {
                                                "url": f"data:image/jpeg;base64,{img_base64}"
                                            }
                                        }
                                    ]
                                }
                            ],
                            "max_tokens": self.config.MAX_NEW_TOKENS,
                            "temperature": self.config.TEMPERATURE,
                            "top_p": self.config.TOP_P,
                            "stream": False
                        }
                        
                        start_time = time.time()
                        response = self.session.post(
                            api_url,
                            json=single_payload,
                            timeout=self.config.API_TIMEOUT,
                            headers={"Content-Type": "application/json"}
                        )
                        
                        api_time = time.time() - start_time
                        
                        if response.status_code == 200:
                            result = response.json()
                            content = result['choices'][0]['message']['content']
                            with responses_lock:
                                responses[index] = content
                            logger.debug(f"[vLLM-API] 请求 {index+1} 成功,耗时: {api_time:.2f}秒")
                        else:
                            error_msg = f"vLLM API请求失败,状态码: {response.status_code}"
                            with responses_lock:
                                responses[index] = f"API错误: {error_msg}"
                            logger.warning(f"[vLLM-API] 请求 {index+1} 失败: {error_msg}")
                    
                    except Exception as e:
                        error_msg = f"请求异常: {e}"
                        with responses_lock:
                            responses[index] = error_msg
                        logger.error(f"[vLLM-API] 请求 {index+1} 异常: {error_msg}")
                
                # 使用线程池并发发送请求(利用vLLM的并发处理能力)
                max_workers = min(len(batch_requests), 128)  # 提高并发度:最多64个线程,与批次大小保持一致,但不超过64
                with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
                    futures = [
                        executor.submit(single_request, i, req) 
                        for i, req in enumerate(batch_requests)
                    ]
                    
                    # 等待所有请求完成
                    concurrent.futures.wait(futures)
                
                # 检查是否所有请求都有响应
                if all(r is not None for r in responses):
                    success_count = len([r for r in responses if not r.startswith(('API错误:', '请求异常:'))])
                    logger.info(f"[vLLM-API] 批次完成,成功: {success_count}/{len(responses)}")
                    return responses
                else:
                    logger.warning(f"[vLLM-API] 批次中有请求未完成,重试...")
                    if retry_count < self.config.REQUEST_RETRY_TIMES - 1:
                        time.sleep(self.config.RETRY_DELAY * (retry_count + 1))
                        continue
                
            except Exception as e:
                error_msg = f"批次处理异常: {e}"
                logger.warning(f"[vLLM-API] {error_msg}")
                
                if retry_count < self.config.REQUEST_RETRY_TIMES - 1:
                    time.sleep(self.config.RETRY_DELAY * (retry_count + 1))
                    continue
        
        # 所有重试失败,返回错误响应
        return [f"系统错误: 批次处理失败" for _ in batch_requests]
    
    def _build_vllm_prompt(self, sample: Dict) -> str:
        """构建vLLM格式的验证提示词(优化prefix cache)"""
        question = sample.get("question", "")
        answer = sample.get("answer", "")
        
        # vLLM格式的提示词,将图片放在固定部分后面,问题前面
        # 这样固定的提示词部分可以被prefix cache缓存
        prompt = f"USER: 你是一个严谨的、注重事实的多模态问答评估员。请基于图片内容,判断\"待判断答案\"是否准确地回答了\"问题\"。\n\n"
        prompt += f"**要求:**\n"
        prompt += f"1. **事实为先**: 你的判断必须严格基于图片内容,不能有任何想象或推断。\n"
        prompt += f"2. **宽松标准**: '完全正确'、'基本正确'、'大致正确'、'部分正确'都视为【正确】。只有'完全错误'、'严重错误'、'明显不符'才视为【错误】。\n"
        prompt += f"3. **输出格式必须如下**:\n\n"
        prompt += f"判断:[正确/基本正确/部分正确/错误]\n"
        prompt += f"置信度:[0.0-1.0]\n"
        prompt += f"理由:[简要说明判断依据,指出答案中正确或错误的关键点]\n\n"
        prompt += f"<image>\n\n"  # 图片放在固定提示词后面
        prompt += f"--- 以下是待评估内容 ---\n"
        prompt += f"问题:{question}\n"
        prompt += f"待判断答案:{answer}\nASSISTANT:"
        
        return prompt
    
    def _create_error_result(self, error_msg: str) -> ValidationResult:
        """创建错误结果"""
        return ValidationResult(
            judgment=f"vLLM错误: {error_msg}",
            confidence=0.0,
            is_correct=False,
            raw_output=error_msg,
            reasoning=f"vLLM批推理过程中发生错误: {error_msg}"
        )

# ---------------------------------------------------------------------------
# 一步验证器(宽松标准)
# ---------------------------------------------------------------------------
class OneStepValidator:
    """一步验证器(宽松标准:部分正确也算正确)"""
    
    def __init__(self, config: ValidationConfig, api_url: str):
        self.config = config
        self.inference_engine = InternVLInference(config, api_url)
        self.checkpoint_interval = 100  # 每100条数据创建检查点
        self.checkpoint_dir = None  # 检查点目录,在run_validation中设置
        
    def _find_checkpoint_files(self, step_name: str) -> List[Tuple[int, str, str]]:
        """
        查找指定步骤的检查点文件
        返回: [(处理数量, 正确文件路径, 错误文件路径), ...],按处理数量排序
        """
        if not self.checkpoint_dir or not os.path.exists(self.checkpoint_dir):
            return []
        
        checkpoints = []
        pattern = f"checkpoint_{step_name}_"
        
        for filename in os.listdir(self.checkpoint_dir):
            if filename.startswith(pattern) and filename.endswith("_correct.json"):
                try:
                    # 提取处理数量 - 格式: checkpoint_{step_name}_{processed_count:04d}_correct.json
                    parts = filename.replace(pattern, "").replace("_correct.json", "")
                    processed_count = int(parts)
                    
                    correct_path = os.path.join(self.checkpoint_dir, filename)
                    incorrect_path = os.path.join(self.checkpoint_dir, 
                                                filename.replace("_correct.json", "_incorrect.json"))
                    
                    # 确保两个文件都存在且可读
                    if os.path.exists(correct_path) and os.path.exists(incorrect_path):
                        # 验证文件完整性
                        try:
                            with open(correct_path, 'r', encoding='utf-8') as f:
                                json.load(f)
                            with open(incorrect_path, 'r', encoding='utf-8') as f:
                                json.load(f)
                            checkpoints.append((processed_count, correct_path, incorrect_path))
                        except json.JSONDecodeError as e:
                            logger.warning(f"[CHECKPOINT] 检查点文件损坏,跳过: {filename} - {e}")
                            continue
                except ValueError:
                    logger.warning(f"[CHECKPOINT] 无法解析检查点文件名: {filename}")
                    continue
        
        # 按处理数量排序,返回最新的在前面
        return sorted(checkpoints, key=lambda x: x[0], reverse=True)
    
    def _load_latest_checkpoint(self, step_name: str) -> Tuple[Optional[int], List[Dict], List[Dict]]:
        """
        加载最新的检查点数据
        返回: (已处理数量, 正确数据列表, 错误数据列表)
        """
        checkpoints = self._find_checkpoint_files(step_name)
        
        if not checkpoints:
            logger.info(f"[CHECKPOINT] 未找到 '{step_name}' 步骤的检查点文件")
            return None, [], []
        
        logger.info(f"[CHECKPOINT] 发现 {len(checkpoints)} 个检查点,尝试加载最新的...")
        
        # 尝试加载最新的检查点
        for processed_count, correct_path, incorrect_path in checkpoints:
            try:
                logger.info(f"[CHECKPOINT] 尝试加载检查点: {processed_count} 个已处理样本")
                
                # 加载正确数据
                with open(correct_path, 'r', encoding='utf-8') as f:
                    correct_data = json.load(f)
                
                # 加载错误数据
                with open(incorrect_path, 'r', encoding='utf-8') as f:
                    incorrect_data = json.load(f)
                
                # 验证数据完整性
                actual_count = len(correct_data) + len(incorrect_data)
                if actual_count != processed_count:
                    logger.warning(f"[CHECKPOINT] 数据不一致: 预期{processed_count},实际{actual_count},尝试下一个检查点...")
                    continue
                
                logger.info(f"[CHECKPOINT] ✅ 成功加载检查点: 正确={len(correct_data)}, 错误={len(incorrect_data)}, 总计={processed_count}")
                return processed_count, correct_data, incorrect_data
                
            except Exception as e:
                logger.warning(f"[CHECKPOINT] ⚠️ 加载检查点失败 ({processed_count}): {e},尝试下一个...")
                continue
        
        logger.warning(f"[CHECKPOINT] ❌ 所有检查点加载失败,将从头开始")
        return None, [], []
    
    def _create_checkpoint(self, step_name: str, processed_count: int, 
                          correct_data: List[Dict], incorrect_data: List[Dict]):
        """创建检查点"""
        if not self.checkpoint_dir:
            return
            
        try:
            checkpoint_name = f"checkpoint_{step_name}_{processed_count:04d}"
            correct_path = os.path.join(self.checkpoint_dir, f"{checkpoint_name}_correct.json")
            incorrect_path = os.path.join(self.checkpoint_dir, f"{checkpoint_name}_incorrect.json")
            
            # 保存正确数据(对于进度检查点,需要保留gt_valid字段以便恢复;对于最终检查点,清理字段)
            if step_name == "final":
                cleaned_correct = clean_validation_data(correct_data) if correct_data else []
            else:
                cleaned_correct = correct_data if correct_data else []
            
            save_json(correct_path, cleaned_correct)
            save_json(incorrect_path, incorrect_data if incorrect_data else [])
            
            logger.info(f"✓ 检查点已创建 [{checkpoint_name}]: 正确={len(correct_data)}, 错误={len(incorrect_data)}")
            
            # 清理旧的检查点(保留最近的3个)
            self._cleanup_old_checkpoints(step_name, keep_count=3)
            
        except Exception as e:
            logger.error(f"[CHECKPOINT] 创建检查点失败: {e}")
            
    def _cleanup_old_checkpoints(self, step_name: str, keep_count: int = 3):
        """清理旧的检查点文件,保留最新的几个"""
        try:
            checkpoints = self._find_checkpoint_files(step_name)
            if len(checkpoints) <= keep_count:
                return
        
==================================
"""
    
    with open(report_path, "w", encoding="utf-8") as f:
        f.write(report_content.strip())
        
    logger.info(f"\n报告已生成: {report_path}")
    print("\n" + report_content.strip())


启动命令

source /usr/local/lib/miniconda3/bin/activate SenseRL && \
vllm serve /mnt/afs/share_models/git_models/OpenGVLab/InternVL3-78B/ \
    --served-model-name Qwen2.5-VL-72B-Instruct \
    --port 6668 \
    --host 0.0.0.0 \
    --tensor-parallel-size 8 \
    --distributed-executor-backend mp \
    --disable-custom-all-reduce \
    --dtype half \
    --max-model-len 60000 \
    --trust-remote-code \
    --max-num-batched-tokens 65536 \
    --max-num-seqs 128 \
    --gpu-memory-utilization 0.95 \
    --swap-space 0 \
    --cpu-offload-gb 0 \
    --block-size 32 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --max-seq-len-to-capture 8192 \
    --limit-mm-per-prompt '{"image": 1}' \
    --disable-log-requests \
    --disable-log-stats \
    --load-format auto

你的CLI脚本和服务端参数基本没问题,但精度远低于pipeline/transformers,常见根因如下:

  1. prompt格式/模板不一致:你的CLI脚本用的是自定义字符串拼接prompt(_build_vllm_prompt),而pipeline/transformers通常用apply_chat_template或官方chat模板。多模态模型(如InternVL/Qwen2.5-VL)对prompt格式极其敏感,chat模板不同会导致模型理解完全不同,输出差异极大。相关issue

  2. 图片输入方式不一致:CLI脚本用base64编码图片,API服务端能否正确还原图片、插入到prompt的正确位置,取决于服务端和模型的多模态处理逻辑。如果和pipeline的图片处理/插入方式不同,模型实际看到的内容就不同,精度会大幅下降。

  3. tokenizer和特殊token不一致:pipeline通常直接用AutoTokenizer.apply_chat_template,而CLI脚本是手动拼接字符串,容易遗漏特殊token(如<|im_start|>、<|im_end|>、等),导致分词和输入token序列不同。

  4. 推理参数/stop_token_ids不一致:虽然你设置了max-model-lenmax-num-seqs等,但如果stop_token_idstemperaturetop_p等参数和pipeline/transformers不完全一致,也会影响生成内容。

如何排查/修复:

  • 强烈建议用tokenizer.apply_chat_template生成prompt,确保和pipeline完全一致。
  • 检查图片插入方式,确保API服务端和pipeline的图片处理逻辑一致(如Qwen2.5-VL需用multi_modal_dataimage_url,且prompt中有占位符)。
  • 对比CLI和pipeline的实际输入token序列(可打印tokenizer.encode后的结果),确保完全一致。
  • 检查API服务端和pipeline的推理参数、stop_token_ids等是否一致。
  • 参考官方多模态推理示例相关文档

结论:只要prompt、图片输入、tokenizer、推理参数完全一致,CLI serve和pipeline精度应无本质差别。你目前的精度低,99%是prompt和图片输入方式与pipeline/transformers不一致导致的。

需要详细的prompt构造/图片处理代码示例吗?

Sources:

是pipeline远低于cli,你看下我的脚本

你的问题是:pipeline模式下base64图片推理精度远低于CLI serve,想让我帮你分析脚本原因。

根据你前面贴的pipeline脚本和CLI serve脚本,根本原因极可能是图片输入方式和prompt模板不一致。vLLM多模态模型(如Qwen2.5-VL/InternVL)对图片插入方式和prompt格式极其敏感。pipeline模式你是直接传PIL.Image对象给multi_modal_data,而CLI serve是base64编码后通过API传递,API服务端会自动解码并插入到正确位置。如果prompt模板、图片插入点、tokenizer处理方式有任何不同,模型实际看到的内容就不同,精度会大幅下降官方issue明确指出,prompt和图片处理方式必须和API服务端完全一致,否则多模态模型表现会极差。

建议:

  1. pipeline模式下,prompt必须用tokenizer.apply_chat_template生成,图片用multi_modal_data={"image": image}传递,且prompt中有占位符,和API服务端一致。
  2. CLI serve时,图片用base64编码,API服务端会自动解码插入,prompt格式要和pipeline完全一致。
  3. 强烈建议对比两种方式下的tokenizer.encode结果,确保输入token序列完全一致。

如需详细代码示例或对比方法,请告知。

Sources:

Base64 编码和直接插入哪个精度更高
max_model_len、max_batch_num_token、max_num_seqs怎么在我的pipeline脚本里设置

在vLLM多模态推理中,直接插入PIL.Image对象到multi_modal_data(即pipeline方式)精度更高,因为这样能保证图片被正确处理和插入到模型预期的位置;而base64编码图片再拼接到prompt中,容易导致tokenizer把图片内容当作文本处理,极大影响精度。官方issue明确建议pipeline脚本用multi_modal_data传递图片,prompt中用占位符。

参数设置方法如下(以pipeline脚本为例):

from vllm import LLM, SamplingParams

llm = LLM(
    model="你的模型路径",
    max_model_len=32768,           # 上下文长度
    max_num_batched_tokens=8192,   # 每批最大token数
    max_num_seqs=16,               # 每批最大序列数
    gpu_memory_utilization=0.9,    # 显存利用率
)

参数可在LLM初始化时直接传入,具体数值建议根据GPU显存和实际需求调整。官方文档有详细说明。

需要更详细的参数调优建议或代码示例吗?

Sources: