The error vllm: error: unrecognized arguments: serve zai-org/GLM-4.7-Flash means you should not include vllm serve in the Docker run command when using the vllm/vllm-openai image (or your custom image based on it). The entrypoint is already set to vllm, so you should start with serve as the first argument.
Correct usage:
docker run -d --gpus all --ipc=host --network host -p 8000:8000 \
-v /home/ubuntu/vllm_models:/root/.cache/huggingface \
-e HUGGING_FACE_HUB_TOKEN="hf_UhSesM" \
--shm-size=10g --ulimit memlock=-1 --ulimit stack=67108864 \
--restart unless-stopped --name vllm_GLM-4.7-Flash vllm-openai-updated \
serve zai-org/GLM-4.7-Flash \
--served-model-name gemma3 \
--dtype auto \
--max-model-len 4096 \
--max-num-seqs 64 \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching \
--enable-chunked-prefill \
--swap-space 12 \
--tensor-parallel-size 2 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--trust-remote-code \
--enable-auto-tool-choice \
--max-num-batched-tokens 3000
Notice serve is the first argument, not vllm serve.
Would you like more detail on Docker entrypoints or troubleshooting?
Sources:
Hint: Mention @RunLLM in the post for followups.