Still getting error with this command:
docker run -d --gpus all --ipc=host --network host -p 8000:8000 -v /home/ubuntu/vllm_models:/root/.cache/huggingface -e HUGGING_FACE_HUB_TOKEN=“hf_UhSesM” -e VLLM_SLEEP_WHEN_IDLE=1 --shm-size=10g --ulimit memlock=-1 --ulimit stack=67108864 --restart unless-stopped --name vllm_GLM-4.7-Flash vllm-openai-updated --model zai-org/GLM-4.7-Flash --served-model-name gemma3 --dtype auto --max-model-len 4096 --max-num-seqs 64 --gpu-memory-utilization 0.90 --enable-prefix-caching --enable-chunked-prefill --swap-space 12 --tensor-parallel-size 2 --speculative-config.method mtp --speculative-config.num_speculative_tokens 1 --tool-call-parser glm47 --reasoning-parser glm45 --trust-remote-code --enable-auto-tool-choice --max-num-batched-tokens 3000
WARNING: Published ports are discarded when using host network mode
3c7c48374c131ef5543d032a6bfae1891d6be9e085ab6de8b4d1b5106afd3348
ubuntu@ubuntuRTX:~$ docker attach vllm_GLM-4.7-Flash
WARNING 01-21 18:17:39 [argparse_utils.py:195] With vllm serve, you should provide the model as a positional argument or in a config file instead of via the --model option. The --model option will be removed in v0.13.
(APIServer pid=1) INFO 01-21 18:17:39 [api_server.py:1272] vLLM API server version 0.14.0
(APIServer pid=1) INFO 01-21 18:17:39 [utils.py:263] non-default args: {‘model_tag’: ‘zai-org/GLM-4.7-Flash’, ‘enable_auto_tool_choice’: True, ‘tool_call_parser’: ‘glm47’, ‘model’: ‘zai-org/GLM-4.7-Flash’, ‘trust_remote_code’: True, ‘max_model_len’: 4096, ‘served_model_name’: [‘gemma3’], ‘reasoning_parser’: ‘glm45’, ‘tensor_parallel_size’: 2, ‘swap_space’: 12.0, ‘enable_prefix_caching’: True, ‘max_num_batched_tokens’: 3000, ‘max_num_seqs’: 64, ‘enable_chunked_prefill’: True, ‘speculative_config’: {‘method’: ‘mtp’, ‘num_speculative_tokens’: 1}}
(APIServer pid=1) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) INFO 01-21 18:17:47 [model.py:530] Resolved architecture: TransformersMoEForCausalLM
(APIServer pid=1) INFO 01-21 18:17:47 [model.py:1545] Using max model len 4096
(APIServer pid=1) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) INFO 01-21 18:17:48 [model.py:530] Resolved architecture: TransformersMoEForCausalLM
(APIServer pid=1) INFO 01-21 18:17:48 [model.py:1545] Using max model len 202752
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File “/usr/local/bin/vllm”, line 7, in
(APIServer pid=1) sys.exit(main())
(APIServer pid=1) ^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py”, line 73, in main
(APIServer pid=1) args.dispatch_function(args)
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py”, line 60, in cmd
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/uvloop/init.py”, line 96, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/asyncio/runners.py”, line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/asyncio/runners.py”, line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “uvloop/loop.pyx”, line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/uvloop/init.py”, line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 1319, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 1338, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/contextlib.py”, line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 173, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/contextlib.py”, line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 199, in build_async_engine_client_from_engine_args
(APIServer pid=1) vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 1626, in create_engine_config
(APIServer pid=1) speculative_config = self.create_speculative_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 1337, in create_speculative_config
(APIServer pid=1) return SpeculativeConfig(**self.speculative_config)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py”, line 121, in init
(APIServer pid=1) s.pydantic_validator.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/config/speculative.py”, line 389, in post_init
(APIServer pid=1) raise NotImplementedError(
(APIServer pid=1) NotImplementedError: Speculative decoding with draft model is not supported yet. Please consider using other speculative decoding methods such as ngram, medusa, eagle, or mtp.