GLM-4.7-Flash with nvidia

Should this zai-org/GLM-4.7-Flash · Hugging Face
work with the vllm/vllm-openai:latest

getting errors:
(APIServer pid=1) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
(APIServer pid=1) Value error, The checkpoint you are trying to load has model type glm4_moe_lite but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

No, the default vllm/vllm-openai:latest image will not work out-of-the-box with zai-org/GLM-4.7-Flash because the model type glm4_moe_lite is not recognized by the standard Transformers release. You must install the latest vLLM nightly and install Transformers from source (main branch) to get support for this new architecture. The error you see is expected with older or default installations. See the official instructions and user reports for details and the required installation commands:

pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
pip install -U git+https://github.com/huggingface/transformers.git

After updating, you can serve the model as shown in the Hugging Face model card and user issue threads.

Would you like more detail on how to build a custom Docker image or run these commands inside a container?

Sources:

Hint: Mention @RunLLM in the post for followups.

yes, instruct to create custom docker image

To create a custom Docker image for vLLM that supports zai-org/GLM-4.7-Flash, start from the official vllm/vllm-openai:latest image and add a layer to install the latest nightly vLLM and Transformers from source. Here is a sample Dockerfile:

FROM vllm/vllm-openai:latest

RUN pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
RUN pip install -U git+https://github.com/huggingface/transformers.git

Build your image with:

docker build -t vllm-glm47flash .

Then run your container as usual, mounting your model directory and passing the required vLLM arguments.

Would you like a full example of the docker run command or more details on customizing the image?

Sources:

Hint: Mention @RunLLM in the post for followups.

yes please give the docker run commnand.
When I tried to run it got:
(APIServer pid=1) NotImplementedError: Speculative decoding with draft model is not supported yet. Please consider using other speculative decoding methods such as ngram, medusa, eagle, or mtp.

To run your custom image, use a command like the following (replace <your_hf_token> and adjust model path/args as needed):

docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=<your_hf_token>" \
  -p 8000:8000 \
  --ipc=host \
  vllm-glm47flash \
  vllm serve zai-org/GLM-4.7-Flash \
    --tensor-parallel-size 4 \
    --speculative-config.method mtp \
    --speculative-config.num_speculative_tokens 1 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --served-model-name glm-4.7-flash

The error NotImplementedError: Speculative decoding with draft model is not supported yet means you must use a supported speculative method (e.g., mtp, ngram, medusa, eagle) and not a generic draft model. The example above uses mtp, which is supported for GLM-4.7-Flash.

Would you like more details on speculative decoding methods or further troubleshooting?

Sources:

Hint: Mention @RunLLM in the post for followups.

Still getting error with this command:
docker run -d --gpus all --ipc=host --network host -p 8000:8000 -v /home/ubuntu/vllm_models:/root/.cache/huggingface -e HUGGING_FACE_HUB_TOKEN=“hf_UhSesM” -e VLLM_SLEEP_WHEN_IDLE=1 --shm-size=10g --ulimit memlock=-1 --ulimit stack=67108864 --restart unless-stopped --name vllm_GLM-4.7-Flash vllm-openai-updated --model zai-org/GLM-4.7-Flash --served-model-name gemma3 --dtype auto --max-model-len 4096 --max-num-seqs 64 --gpu-memory-utilization 0.90 --enable-prefix-caching --enable-chunked-prefill --swap-space 12 --tensor-parallel-size 2 --speculative-config.method mtp --speculative-config.num_speculative_tokens 1 --tool-call-parser glm47 --reasoning-parser glm45 --trust-remote-code --enable-auto-tool-choice --max-num-batched-tokens 3000
WARNING: Published ports are discarded when using host network mode
3c7c48374c131ef5543d032a6bfae1891d6be9e085ab6de8b4d1b5106afd3348
ubuntu@ubuntuRTX:~$ docker attach vllm_GLM-4.7-Flash
WARNING 01-21 18:17:39 [argparse_utils.py:195] With vllm serve, you should provide the model as a positional argument or in a config file instead of via the --model option. The --model option will be removed in v0.13.
(APIServer pid=1) INFO 01-21 18:17:39 [api_server.py:1272] vLLM API server version 0.14.0
(APIServer pid=1) INFO 01-21 18:17:39 [utils.py:263] non-default args: {‘model_tag’: ‘zai-org/GLM-4.7-Flash’, ‘enable_auto_tool_choice’: True, ‘tool_call_parser’: ‘glm47’, ‘model’: ‘zai-org/GLM-4.7-Flash’, ‘trust_remote_code’: True, ‘max_model_len’: 4096, ‘served_model_name’: [‘gemma3’], ‘reasoning_parser’: ‘glm45’, ‘tensor_parallel_size’: 2, ‘swap_space’: 12.0, ‘enable_prefix_caching’: True, ‘max_num_batched_tokens’: 3000, ‘max_num_seqs’: 64, ‘enable_chunked_prefill’: True, ‘speculative_config’: {‘method’: ‘mtp’, ‘num_speculative_tokens’: 1}}
(APIServer pid=1) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) INFO 01-21 18:17:47 [model.py:530] Resolved architecture: TransformersMoEForCausalLM
(APIServer pid=1) INFO 01-21 18:17:47 [model.py:1545] Using max model len 4096
(APIServer pid=1) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) INFO 01-21 18:17:48 [model.py:530] Resolved architecture: TransformersMoEForCausalLM
(APIServer pid=1) INFO 01-21 18:17:48 [model.py:1545] Using max model len 202752
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File “/usr/local/bin/vllm”, line 7, in
(APIServer pid=1) sys.exit(main())
(APIServer pid=1) ^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py”, line 73, in main
(APIServer pid=1) args.dispatch_function(args)
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py”, line 60, in cmd
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/uvloop/init.py”, line 96, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/asyncio/runners.py”, line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/asyncio/runners.py”, line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “uvloop/loop.pyx”, line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/uvloop/init.py”, line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 1319, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 1338, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/contextlib.py”, line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 173, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/contextlib.py”, line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 199, in build_async_engine_client_from_engine_args
(APIServer pid=1) vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 1626, in create_engine_config
(APIServer pid=1) speculative_config = self.create_speculative_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 1337, in create_speculative_config
(APIServer pid=1) return SpeculativeConfig(**self.speculative_config)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py”, line 121, in init
(APIServer pid=1) s.pydantic_validator.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/config/speculative.py”, line 389, in post_init
(APIServer pid=1) raise NotImplementedError(
(APIServer pid=1) NotImplementedError: Speculative decoding with draft model is not supported yet. Please consider using other speculative decoding methods such as ngram, medusa, eagle, or mtp.

Your error NotImplementedError: Speculative decoding with draft model is not supported yet. Please consider using other speculative decoding methods such as ngram, medusa, eagle, or mtp. means that vLLM is interpreting your config as a “draft model” setup, which is not supported for speculative decoding. For GLM-4.7-Flash, you must use the mtp method, but the config must be passed as a single JSON string to --speculative-config, not as separate --speculative-config.method and --speculative-config.num_speculative_tokens flags.

How to fix:
Replace

--speculative-config.method mtp --speculative-config.num_speculative_tokens 1

with

--speculative-config '{"method":"mtp","num_speculative_tokens":1}'

in your docker run command. This is the only supported way to pass speculative decoding config in recent vLLM versions (see docs).

Example:

docker run -d --gpus all --ipc=host --network host -p 8000:8000 \
  -v /home/ubuntu/vllm_models:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN="hf_UhSesM" \
  --shm-size=10g --ulimit memlock=-1 --ulimit stack=67108864 \
  --restart unless-stopped --name vllm_GLM-4.7-Flash vllm-openai-updated \
  vllm serve zai-org/GLM-4.7-Flash \
    --served-model-name gemma3 \
    --dtype auto \
    --max-model-len 4096 \
    --max-num-seqs 64 \
    --gpu-memory-utilization 0.90 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --swap-space 12 \
    --tensor-parallel-size 2 \
    --speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --max-num-batched-tokens 3000

Let me know if you want a full working Dockerfile or further troubleshooting steps.

Sources:

Hint: Mention @RunLLM in the post for followups.

with your provided command I got:
vllm: error: unrecognized arguments: serve zai-org/GLM-4.7-Flash

The error vllm: error: unrecognized arguments: serve zai-org/GLM-4.7-Flash means you should not include vllm serve in the Docker run command when using the vllm/vllm-openai image (or your custom image based on it). The entrypoint is already set to vllm, so you should start with serve as the first argument.

Correct usage:

docker run -d --gpus all --ipc=host --network host -p 8000:8000 \
  -v /home/ubuntu/vllm_models:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN="hf_UhSesM" \
  --shm-size=10g --ulimit memlock=-1 --ulimit stack=67108864 \
  --restart unless-stopped --name vllm_GLM-4.7-Flash vllm-openai-updated \
  serve zai-org/GLM-4.7-Flash \
    --served-model-name gemma3 \
    --dtype auto \
    --max-model-len 4096 \
    --max-num-seqs 64 \
    --gpu-memory-utilization 0.90 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --swap-space 12 \
    --tensor-parallel-size 2 \
    --speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --max-num-batched-tokens 3000

Notice serve is the first argument, not vllm serve.

Would you like more detail on Docker entrypoints or troubleshooting?

Sources:

Hint: Mention @RunLLM in the post for followups.