I started testing GLM 4.7-FP8 today using the settings shown below and have noticed that this model doesn’t appear to be outputting an initial . Has anyone else observed something similar? This means that the current GLM45 reasoning parser always returns that the reasoning is null. See curl request output and configuration below:
Curl request
request body:
{"model": "/root/.cache/huggingface/models--zai-org--GLM-4.7-FP8",
"messages": [
{
"role": "system",
"content": "You are a helpful AI assistant."
},
{
"role": "user",
"content": "Hello, how are you?"
}
],
"stream": false}
response: (Note the lack of any initial <think>)
{
"id": "chatcmpl-a1d48ddb66e28f8e",
"object": "chat.completion",
"created": 1766455881,
"model": "/root/.cache/huggingface/models--zai-org--GLM-4.7-FP8",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "1. **Analyze the user's input:** The user said \"Hello, how are you?\".\n2. **Identify the intent:** This is a standard greeting and a polite inquiry about my state.\n3. **Determine the appropriate response:**\n * Acknowledge the greeting.\n * Answer the question (even though I'm an AI and don't have feelings, I should give a standard polite response).\n * Ask how the user is doing or if there's something specific I can help with.\n4. **Drafting the response:**\n * *Option 1:* I am fine. How are you? (Too robotic)\n * *Option 2:* Hello! I'm doing great, thank you for asking. How can I help you today? (Better, standard AI persona)\n * *Option 3:* Hi there! I'm doing well, ready to assist you. What's on your mind? (Friendly and helpful)\n5. **Refining the response:** Option 2 is solid. It covers the greeting, the \"health\" check, and transitions to the purpose of the interaction (helping).\n6. **Final Polish:** \"Hello! I'm doing well, thank you for asking. How can I assist you today?\"</think>Hello! I'm doing well, thank you for asking. How can I assist you today?",
"refusal": null,
"annotations": null,
"audio": null,
"function_call": null,
"tool_calls": [],
"reasoning": null,
"reasoning_content": null
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": 151336,
"token_ids": null
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {
"prompt_tokens": 19,
"total_tokens": 311,
"completion_tokens": 292,
"prompt_tokens_details": null
},
"prompt_logprobs": null,
"prompt_token_ids": null,
"kv_transfer_params": null
}
DOCKER_IMAGE="vllm/vllm-openai:nightly-aarch64"
HF_CACHE_PATH="/shared-storage/models/llm-weights"
MODEL_PATH="/root/.cache/huggingface/models--zai-org--GLM-4.7-FP8"
TENSOR_PARALLEL_SIZE=4
RAY_PORT=6379
VLLM_API_PORT=8000
TRITON_CACHE="/tmp/vllm-cache/triton"
TORCHINDUCTOR_CACHE="/tmp/vllm-cache/torchinductor"
VLLM_CACHE="/tmp/vllm-cache/vllm"
TORCH_CACHE="/tmp/vllm-cache/torch"
VLLM_SERVE_CMD="vllm serve ${MODEL_PATH} \
--tensor-parallel-size ${TENSOR_PARALLEL_SIZE} \
--distributed-executor-backend ray \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--max-model-len 180000 \
--calculate-kv-scales \
--max-num-seqs 20 \
--max-num-batched-tokens 16384 \
--max-parallel-loading-workers 8 \
--load-format safetensors \
--enable-chunked-prefill \
--host 0.0.0.0 \
--port ${VLLM_API_PORT}"
This is running on four GH200 (144GB).