Vllm offline pipeline v0.9 默认参数

Jasper · August 2, 2025, 7:46am

vllm offline pipeline v0.9.0 默认参数是怎么样的
我是部署InternVL3-78模型

同时升级到v1.0会有什么提升

RunLLM · August 2, 2025, 7:46am

vLLM 离线推理（offline inference）默认参数主要由 LLM 类的构造参数决定，如 model、max_model_len、max_num_seqs 等，未指定时会采用模型自带或 vLLM 默认值。例如，max_model_len 默认为模型支持的最大长度，max_num_seqs 默认为 1，具体可参考官方文档和示例代码。对于 InternVL3-78B，建议根据显存适当设置 max_model_len 和 max_num_seqs。

从 v0.9.0 升级到 v1.0，vLLM V1 引入了统一调度器、KV cache 管理器等核心重构，提升了长上下文场景下的性能，并简化了配置，许多优化默认开启，减少了手动参数调整需求。具体提升包括更高性能、近零 CPU 开销和更易维护的架构，详见vLLM V1 用户指南。

需要更详细的参数说明或升级影响分析吗？

Sources:

Jasper · August 2, 2025, 7:48am

升级到v1.0会有其他环境的版本冲突吗？是否有发生过这样的案例

RunLLM · August 2, 2025, 7:48am

目前官方文档和 issue 没有明确提到 vLLM v1.0 升级后与其他环境依赖（如 PyTorch、transformers 等）发生严重版本冲突的案例。vLLM 官方会在发布说明中列出兼容的依赖版本，建议严格按照推荐环境安装，避免冲突。

如遇到依赖冲突，常见原因多为 PyTorch、CUDA、transformers 版本不匹配，建议参考已知问题与环境建议。目前未见大规模因 v1.0 升级导致依赖冲突的公开案例。

需要具体依赖兼容性列表或实际冲突排查方法吗？

Sources:

Jasper · August 2, 2025, 7:49am

同样的参数、模型、显卡，v1.0会比v0.9.0更快吗

RunLLM · August 2, 2025, 7:50am

vLLM v1.0 在长上下文和高并发场景下通常比 v0.9.0 更快，得益于统一调度器、KV cache 管理器等架构优化，且许多性能特性默认开启。但在部分模型或特定配置下，v1.0 也有用户反馈性能略低于 v0.9.0，具体表现与模型类型、推理参数和硬件环境有关，详见性能对比讨论和官方性能基准。

建议在自身业务场景下实际 benchmark，绝大多数情况下 v1.0 性能优于 v0.9.0，尤其是长文本和大批量推理。如果遇到性能下降，可参考 issue 讨论进行参数微调。

需要详细 benchmark 方法或调优建议吗？

Sources:

Jasper · August 2, 2025, 7:51am

disable-custom-all-reduce
–dtype half
–max-model-len 32768
–trust-remote-code
–max-num-batched-tokens 65536
–max-num-seqs 128
–gpu-memory-utilization 0.95
–swap-space 0
–cpu-offload-gb 0
–block-size 32
–enable-chunked-prefill
–enable-prefix-caching
–max-seq-len-to-capture 8192
–limit-mm-per-prompt ‘{“image”: 1}’
–disable-log-requests
–disable-log-stats
–load-format auto

这些参数，默认都是怎么设置的

RunLLM · August 2, 2025, 7:52am

这些参数的默认值如下（如未手动指定）：

–disable-custom-all-reduce：False（默认启用自定义 all-reduce，部分平台自动禁用）
–dtype：auto（FP16 或 BF16，取决于模型权重类型）
–max-model-len：自动读取模型配置，未指定时为模型最大支持长度
–trust-remote-code：False
–max-num-batched-tokens：未指定时，offline inference 默认 4096world_size，online serving 默认 2048world_size
–max-num-seqs：offline inference 默认 256world_size，online serving 默认 128world_size
–gpu-memory-utilization：0.9
–swap-space：4（单位 GiB）
–cpu-offload-gb：0
–block-size：平台相关，CUDA 支持最大 32，HPU 默认 128
–enable-chunked-prefill：V1 默认开启
–enable-prefix-caching：V1 默认开启
–max-seq-len-to-capture：8192
–limit-mm-per-prompt：每种 modality 默认 1
–disable-log-requests：False
–disable-log-stats：False
–load-format：auto

详细参数说明可参考官方文档和参数表。

需要更详细的参数解释或实际配置示例吗？

Sources:

Jasper · August 2, 2025, 8:55am

–max-num-batched-tokens：未指定时，offline inference 默认 4096world_size，online serving 默认 2048world_size
–max-num-seqs：offline inference 默认 256world_size，online serving 默认 128world_size

这里面的4906 world_size是什么意思

–enable-chunked-prefill：
–enable-prefix-caching：
这两个参数，在v0.9.0的情况下是否会默认开启

RunLLM · August 2, 2025, 8:56am

world_size 指的是并行进程数，通常等于使用的 GPU 数量。例如 4 卡并行时，world_size=4。
–enable-chunked-prefill 和 –enable-prefix-caching 在 v0.9.0 下默认均为关闭，需手动指定才会开启，V1 版本才默认开启这两个特性。参数默认值说明和 chunked prefill 机制说明。

需要详细解释 world_size 的作用或 chunked prefill、prefix caching 的原理吗？

Sources:

Jasper · August 2, 2025, 9:11am

意思就是：
假设tp=8
–max-num-batched-tokens：未指定时，offline inference 默认 4096**8，online serving 默认 2048*8*

–max-num-seqs：offline inference 默认 256**8，online serving 默认 128*8*

Jasper · August 2, 2025, 5:48pm

在vllm v0.9.0的情况下，offline pipeline
max-tokens默认是多少

RunLLM · August 2, 2025, 5:48pm

我的理解是，在 vLLM v0.9.0 下，offline pipeline（即离线推理）模式下，max_tokens（最大生成 token 数）没有全局默认值，需在 SamplingParams 里为每次推理单独指定。如果未指定，通常会由模型 tokenizer 或 SamplingParams 的默认行为决定，但不是 vLLM 层面全局参数。相关文档和采样参数说明未显示有全局 max_tokens 默认值。

如果你指的是 max_num_batched_tokens（单批最大 token 数），offline inference 默认值为 4096 × world_size。官方说明。

需要更详细的参数说明或代码示例吗？

Sources:

官方说明

Jasper · August 2, 2025, 6:18pm

以下是我在启动serve的参数

    --served-model-name InternVL3-78B \
    --port 6871 \
    --host 0.0.0.0 \
    --tensor-parallel-size 8 \
    --distributed-executor-backend mp \
    --disable-custom-all-reduce \
    --dtype half \
    --max-model-len 11000 \
    --trust-remote-code \
    --max-num-seqs 16 \
    --gpu-memory-utilization 0.9 \
    --swap-space 0 \
    --cpu-offload-gb 0 \
    --block-size 32 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --max-seq-len-to-capture 8192 \
    --limit-mm-per-prompt '{"image": 1}' \
    --disable-log-requests \
    --disable-log-stats \
    --load-format auto

我如果需要把serve转换成pipeline推理
我还需要设置哪几个参数，是否有部分不需要

RunLLM · August 2, 2025, 6:18pm

将 serve 启动参数从在线服务（API server）转换为 pipeline 离线推理（offline pipeline inference），你只需保留与模型加载、并行度、显存等相关参数，去除 API server 专用参数（如 --port、–host、–served-model-name、–disable-log-requests、–disable-log-stats）。pipeline 离线推理无需 --port、–host、–served-model-name，也不需要日志相关参数。

核心参数建议保留：–model、–tensor-parallel-size、–distributed-executor-backend、–dtype、–max-model-len、–trust-remote-code、–max-num-seqs、–gpu-memory-utilization、–swap-space、–cpu-offload-gb、–block-size、–enable-chunked-prefill、–enable-prefix-caching、–max-seq-len-to-capture、–limit-mm-per-prompt、–load-format。具体可参考官方参数说明。

需要 pipeline 离线推理的具体命令模板或参数解释吗？

Sources:

官方参数说明

Topic		Replies	Views
Offline multi-node inference General	11	281	May 19, 2025
Ascend-vllm中怎么指定batch和seqlen来测试性能 General	4	99	June 27, 2025
How log kvcache usage and prefix hit rate in offline infer? General	19	204	May 13, 2025
关于0.9.0和0.8.5dev948执行时的差异： General	9	190	May 30, 2025
When using large batches, the Ray service crashes.ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read Ray	40	186	August 7, 2025

Vllm offline pipeline v0.9 默认参数

Related topics