Tool call issue with GLM-4.5-Air

adonishong · July 31, 2025, 8:37am

Highly appreciate for your work, your guys work is fantastic!

We are using the latest version of vllm installed via pip to deploy GLM-4.5-Air inference to provide inference services for the Claude Code frontend. Normal conversations work fine, but once tool calls are involved, problems occur. We can see errors like the following:

"
tool call info TodoWrite
ERROR 07-31 08:23:37 [glm4_moe_tool_parser.py:402] Error trying to handle streaming tool call.
ERROR 07-31 08:23:37 [glm4_moe_tool_parser.py:402] Traceback (most recent call last):
ERROR 07-31 08:23:37 [glm4_moe_tool_parser.py:402] File “/home/xxx/anaconda3/envs/py311ktransqwen3/lib/python3.11/site-packages/vllm/entrypoints/openai/tool_parsers/glm4_moe_tool_parser.py”, line 272, in extract_tool_calls_streaming
ERROR 07-31 08:23:37 [glm4_moe_tool_parser.py:402] tool_name = tool_id.split(‘.’)[1].split(‘:’)[0]
ERROR 07-31 08:23:37 [glm4_moe_tool_parser.py:402] ~~~~~~~~~~~~~~~~~~^^^
ERROR 07-31 08:23:37 [glm4_moe_tool_parser.py:402] IndexError: list index out of range
"

”tool call info TodoWrite” comes from my print to console code after glm4_moe_tool_parser.py:402 to check out the tool_id info

This is my startup command:

"
export CONTEXT_LENGTH=131072; vllm serve ./GLM-4.5-Air-GPTQ-Int4-Int8Mix --served-model-name GLM-4.5-Air --enable-expert-parallel --max-num-seqs 32 --max-model-len $CONTEXT_LENGTH --trust-remote-code --tensor-parallel-size 8 --host 0.0.0.0 --port 8000 --enable-auto-tool-choice --tool-call-parser glm4_moe --reasoning-parser glm4_moe --gpu-memory-utilization 0.9
"

RunLLM · July 31, 2025, 8:37am

Thank you for your feedback! The error IndexError: list index out of range in glm4_moe_tool_parser.py occurs when the code tries to split tool_id by . and access the second element, but the split result is too short. This suggests that the tool call output from the model does not match the expected format (e.g., tool_id is missing a . or is malformed). According to glm4_moe_tool_parser.py, the parser expects tool call IDs in a specific format, and if the model outputs something unexpected (like just TodoWrite), this error will occur.

This issue is likely due to the model generating tool call outputs that do not conform to the parser’s expected structure. You may need to check your chat template and ensure the model is prompted to output tool calls in the correct format, or add error handling in the parser to skip or log malformed tool calls. If you need to support more flexible tool call formats, consider modifying the parser to handle cases where tool_id does not contain a .. Would you like a step-by-step guide on how to patch the parser or adjust your template?

Sources:

glm4_moe_tool_parser.py

adonishong · July 31, 2025, 5:57pm

Thank you for your reply, and I was end up at here

seems like already fix this issue, and the solution is ready to release, thank you very much, sincerely appreciate for your output

RunLLM · July 31, 2025, 5:58pm

Yes, you are correct—the issue with the GLM-4 tool streaming output has been addressed in the pull request Fix GLM tool parser, which specifically fixes the GLM-4 tool streaming output issue. This PR is merged and will be included in the next release.

If you need the fix immediately, you can build vLLM from the latest main branch. Would you like instructions on how to do that?

Sources:

Fix GLM tool parser

Topic		Replies	Views
Error trying to handle streaming tool call Tool Calling	3	536	July 17, 2025
How does `tool_choice` effect vllm? General	13	626	November 18, 2025
Tool calling using Offline Inference? Tool Calling	1	165	April 14, 2025
GLM-4.7-Flash with nvidia General	9	2096	January 22, 2026
The OpenAI endpoint doesn't support function strict setting General	17	1195	August 20, 2025

Tool call issue with GLM-4.5-Air

Related topics