GLM 4.7-FP8 Reasoning Start Issues?

kman2949 · December 23, 2025, 2:23am

I started testing GLM 4.7-FP8 today using the settings shown below and have noticed that this model doesn’t appear to be outputting an initial . Has anyone else observed something similar? This means that the current GLM45 reasoning parser always returns that the reasoning is null. See curl request output and configuration below:

Curl request

request body:
{"model": "/root/.cache/huggingface/models--zai-org--GLM-4.7-FP8",
"messages": [
    {
    "role": "system",
    "content": "You are a helpful AI assistant."
    },
    {
        "role": "user",
        "content": "Hello, how are you?"
    }
],
"stream": false}

response: (Note the lack of any initial <think>)
{
    "id": "chatcmpl-a1d48ddb66e28f8e",
    "object": "chat.completion",
    "created": 1766455881,
    "model": "/root/.cache/huggingface/models--zai-org--GLM-4.7-FP8",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "1.  **Analyze the user's input:** The user said \"Hello, how are you?\".\n2.  **Identify the intent:** This is a standard greeting and a polite inquiry about my state.\n3.  **Determine the appropriate response:**\n    *   Acknowledge the greeting.\n    *   Answer the question (even though I'm an AI and don't have feelings, I should give a standard polite response).\n    *   Ask how the user is doing or if there's something specific I can help with.\n4.  **Drafting the response:**\n    *   *Option 1:* I am fine. How are you? (Too robotic)\n    *   *Option 2:* Hello! I'm doing great, thank you for asking. How can I help you today? (Better, standard AI persona)\n    *   *Option 3:* Hi there! I'm doing well, ready to assist you. What's on your mind? (Friendly and helpful)\n5.  **Refining the response:** Option 2 is solid. It covers the greeting, the \"health\" check, and transitions to the purpose of the interaction (helping).\n6.  **Final Polish:** \"Hello! I'm doing well, thank you for asking. How can I assist you today?\"</think>Hello! I'm doing well, thank you for asking. How can I assist you today?",
                "refusal": null,
                "annotations": null,
                "audio": null,
                "function_call": null,
                "tool_calls": [],
                "reasoning": null,
                "reasoning_content": null
            },
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": 151336,
            "token_ids": null
        }
    ],
    "service_tier": null,
    "system_fingerprint": null,
    "usage": {
        "prompt_tokens": 19,
        "total_tokens": 311,
        "completion_tokens": 292,
        "prompt_tokens_details": null
    },
    "prompt_logprobs": null,
    "prompt_token_ids": null,
    "kv_transfer_params": null
}

DOCKER_IMAGE="vllm/vllm-openai:nightly-aarch64"
HF_CACHE_PATH="/shared-storage/models/llm-weights"
MODEL_PATH="/root/.cache/huggingface/models--zai-org--GLM-4.7-FP8"
TENSOR_PARALLEL_SIZE=4
RAY_PORT=6379
VLLM_API_PORT=8000

TRITON_CACHE="/tmp/vllm-cache/triton"
TORCHINDUCTOR_CACHE="/tmp/vllm-cache/torchinductor"
VLLM_CACHE="/tmp/vllm-cache/vllm"
TORCH_CACHE="/tmp/vllm-cache/torch"

VLLM_SERVE_CMD="vllm serve ${MODEL_PATH} \
  --tensor-parallel-size ${TENSOR_PARALLEL_SIZE} \
  --distributed-executor-backend ray \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --max-model-len 180000 \
  --calculate-kv-scales \
  --max-num-seqs 20 \
  --max-num-batched-tokens 16384 \
  --max-parallel-loading-workers 8 \
  --load-format safetensors \
  --enable-chunked-prefill \
  --host 0.0.0.0 \
  --port ${VLLM_API_PORT}"

This is running on four GH200 (144GB).

RunLLM · December 23, 2025, 2:24am

Yes, this is a known issue: GLM-4.7 (and some other GLM-4.x models) sometimes omit the initial <think> tag in their outputs, which causes the current glm45 reasoning parser in vLLM to return reasoning: null even though the model is reasoning internally. The parser expects reasoning content to be enclosed within <think>...</think>, but if the model only outputs the closing tag or omits both, the reasoning field will be empty. This behavior has been observed and discussed for other GLM and Qwen models as well, and is often related to the chat template or model output format, not a vLLM bug per se. See the discussion in GLM-4.5 reasoning parser streaming fails without tools in request and Missing Opening for Qwen32B.

To work around this, you can try modifying the chat template to ensure the model outputs the <think> tag at the start of its reasoning, or prompt the model explicitly to use the <think>...</think> format. This is a model/template issue, not a vLLM configuration problem. For more details on reasoning model support and parser expectations, see Reasoning Outputs documentation.

Would you like more detail on how to modify the chat template or prompt to encourage <think> output?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
GLM-4.7-Flash with nvidia General	9	2060	January 22, 2026
Tool call issue with GLM-4.5-Air General	3	1062	July 31, 2025
Add Support for GLM-4 series model Model Support	1	184	April 16, 2025
Vllm在h800运行模型推理输出的思考模式字段和内容 General	1	40	May 14, 2026
What is the correct chat template when serving gemma4? General	1	210	April 30, 2026

GLM 4.7-FP8 Reasoning Start Issues?

Related topics