流式响应中 usage 字段始终为 None，无法获取 Token 使用量

OliverZ11x · June 10, 2025, 12:42pm

问题描述：
使用 vLLM 容器部署 Qwen3-14B 模型并开启流式返回时，尽管生成过程正常结束（finish_reason=stop），但所有流式块的响应中 usage 字段均为 None，导致无法获取 Token 使用量。以下是具体环境和复现步骤：

环境信息：

vLLM 版本：vllm/vllm-openai:latest（拉取于 2025 年 6 月）
模型：Qwen3-14B（通过 ModelScope 下载，挂载路径 /model/model_xizang）
部署命令：

bash

docker run -d --runtime nvidia --gpus all \
-v /home/ubuntu/.cache/modelscope/hub/models/Qwen/Qwen3-14B:/model/model_xizang \
--env "HUGGING_FACE_HUB_TOKEN=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" \
-p 8000:8000 \
--ipc=host vllm/vllm-openai:latest \
--model /model/model_xizang \
--tensor-parallel-size 2 \
--api-key token-abc123 \
--dtype 'float16' \
--enable-auto-tool-choice \
--tool-call-parser hermes

客户端代码（Python + OpenAI SDK）：

python

运行

from openai import OpenAI

client = OpenAI(
    base_url="http://172.32.1.161:8000/v1",
    api_key="token-abc123",
)

response = client.chat.completions.create(
    model="/model/model_xizang",
    messages=[
        {"role": "system", "content": "你是一个AI助手，请用中文回答用户的问题。"},
        {"role": "user", "content": "你好！"}
    ],
    stream=True,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)

last_chunk = None
for chunk in response:
    print(chunk)
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)
    last_chunk = chunk

if last_chunk and hasattr(last_chunk, 'usage') and last_chunk.usage is not None:
    print(f"\n\n总token使用量: {last_chunk.usage.total_tokens}")
else:
    print("\n\n无法获取token使用量信息")

复现步骤：

按上述命令启动 vLLM 容器，确保模型挂载正确。
运行客户端代码，向 /v1/chat/completions 发送流式请求。
观察输出：生成内容正常，但所有 ChatCompletionChunk 的 usage 字段均为 None，即使最后一个块的 finish_reason 为 stop。

预期行为：
根据 vLLM 文档，流式响应的最后一个块应包含完整的 usage 信息（如 prompt_tokens、completion_tokens、total_tokens）。

实际行为：
所有块的 usage 均为 None，导致无法统计 Token 使用量，终端输出如下：

plaintext

ChatCompletionChunk(..., usage=None)
...
无法获取token使用量信息

其他说明：

非流式模式（stream=False）下，usage 字段正常返回。
尝试过更新 vLLM 版本、更换模型（如 LLaMA-2），问题依然存在。
检查过容器日志，无明显错误提示，模型生成功能正常。

请求帮助：

请问在流式响应中获取 usage 的正确方式是什么？
是否存在配置遗漏（如参数设置）或版本兼容性问题？
如何在流式场景下实时统计 Token 消耗？

Topic		Replies	Views
关于0.9.0和0.8.5dev948执行时的差异： General	9	76	May 30, 2025
使用以下2种方式，获得的结果有很大差异 Benchmarking	40	15	June 13, 2025
Vllm的seed是怎么做到动态随机的？ General	19	69	May 19, 2025
为什么在v1架构中tokenier使用线程池而不使用进程池 General	3	48	May 26, 2025
Why is this not working? I corrected it but still NVIDIA GPU Support	1	95	May 8, 2025

流式响应中 usage 字段始终为 None，无法获取 Token 使用量

Related topics