GLM-4.7-Flash with nvidia

The error vllm: error: unrecognized arguments: serve zai-org/GLM-4.7-Flash means you should not include vllm serve in the Docker run command when using the vllm/vllm-openai image (or your custom image based on it). The entrypoint is already set to vllm, so you should start with serve as the first argument.

Correct usage:

docker run -d --gpus all --ipc=host --network host -p 8000:8000 \
  -v /home/ubuntu/vllm_models:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN="hf_UhSesM" \
  --shm-size=10g --ulimit memlock=-1 --ulimit stack=67108864 \
  --restart unless-stopped --name vllm_GLM-4.7-Flash vllm-openai-updated \
  serve zai-org/GLM-4.7-Flash \
    --served-model-name gemma3 \
    --dtype auto \
    --max-model-len 4096 \
    --max-num-seqs 64 \
    --gpu-memory-utilization 0.90 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --swap-space 12 \
    --tensor-parallel-size 2 \
    --speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --max-num-batched-tokens 3000

Notice serve is the first argument, not vllm serve.

Would you like more detail on Docker entrypoints or troubleshooting?

Sources:

Hint: Mention @RunLLM in the post for followups.