@RunLLM The error continues. Here is the error log, the listing of the path, the hf cache verify, and my docker-compose.yaml:
vllm-qwen35-gptq-int4 | (APIServer pid=1) INFO 03-13 00:10:18 [utils.py:293]
vllm-qwen35-gptq-int4 | (APIServer pid=1) INFO 03-13 00:10:18 [utils.py:293] █ █ █▄ ▄█
vllm-qwen35-gptq-int4 | (APIServer pid=1) INFO 03-13 00:10:18 [utils.py:293] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.16.1rc1.dev173+g8fa68a8ce
vllm-qwen35-gptq-int4 | (APIServer pid=1) INFO 03-13 00:10:18 [utils.py:293] █▄█▀ █ █ █ █ model /mnt/sdb/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/b54fd48a0aae8da6594b67889cc370d465362769
vllm-qwen35-gptq-int4 | (APIServer pid=1) INFO 03-13 00:10:18 [utils.py:293] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
vllm-qwen35-gptq-int4 | (APIServer pid=1) INFO 03-13 00:10:18 [utils.py:293]
vllm-qwen35-gptq-int4 | (APIServer pid=1) INFO 03-13 00:10:18 [utils.py:229] non-default args: {‘enable_auto_tool_choice’: True, ‘tool_call_parser’: ‘qwen3_coder’, ‘host’: ‘0.0.0.0’, ‘model’: ‘/mnt/sdb/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/b54fd48a0aae8da6594b67889cc370d465362769’, ‘max_model_len’: 262144, ‘quantization’: ‘moe_wna16’, ‘served_model_name’: [‘Qwen3.5-397B-A17B-GPTQ-Int4’], ‘reasoning_parser’: ‘qwen3’, ‘tensor_parallel_size’: 8, ‘enable_prefix_caching’: True}
vllm-qwen35-gptq-int4 | (APIServer pid=1) Traceback (most recent call last):
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py”, line 479, in cached_files
vllm-qwen35-gptq-int4 | (APIServer pid=1) hf_hub_download(
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_validators.py”, line 106, in _inner_fn
vllm-qwen35-gptq-int4 | (APIServer pid=1) validate_repo_id(arg_value)
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_validators.py”, line 154, in validate_repo_id
vllm-qwen35-gptq-int4 | (APIServer pid=1) raise HFValidationError(
vllm-qwen35-gptq-int4 | (APIServer pid=1) huggingface_hub.errors.HFValidationError: Repo id must be in the form ‘repo_name’ or ‘namespace/repo_name’: ‘/mnt/sdb/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/b54fd48a0aae8da6594b67889cc370d465362769’. Use repo_type argument if needed.
vllm-qwen35-gptq-int4 | (APIServer pid=1)
vllm-qwen35-gptq-int4 | (APIServer pid=1) During handling of the above exception, another exception occurred:
vllm-qwen35-gptq-int4 | (APIServer pid=1)
vllm-qwen35-gptq-int4 | (APIServer pid=1) Traceback (most recent call last):
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py”, line 721, in _get_config_dict
vllm-qwen35-gptq-int4 | (APIServer pid=1) resolved_config_file = cached_file(
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py”, line 322, in cached_file
vllm-qwen35-gptq-int4 | (APIServer pid=1) file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py”, line 532, in cached_files
vllm-qwen35-gptq-int4 | (APIServer pid=1) _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type)
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py”, line 143, in _get_cache_file_to_return
vllm-qwen35-gptq-int4 | (APIServer pid=1) resolved_file = try_to_load_from_cache(
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_validators.py”, line 106, in _inner_fn
vllm-qwen35-gptq-int4 | (APIServer pid=1) validate_repo_id(arg_value)
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_validators.py”, line 154, in validate_repo_id
vllm-qwen35-gptq-int4 | (APIServer pid=1) raise HFValidationError(
vllm-qwen35-gptq-int4 | (APIServer pid=1) huggingface_hub.errors.HFValidationError: Repo id must be in the form ‘repo_name’ or ‘namespace/repo_name’: ‘/mnt/sdb/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/b54fd48a0aae8da6594b67889cc370d465362769’. Use repo_type argument if needed.
vllm-qwen35-gptq-int4 | (APIServer pid=1)
vllm-qwen35-gptq-int4 | (APIServer pid=1) During handling of the above exception, another exception occurred:
vllm-qwen35-gptq-int4 | (APIServer pid=1)
vllm-qwen35-gptq-int4 | (APIServer pid=1) Traceback (most recent call last):
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “”, line 198, in _run_module_as_main
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “”, line 88, in _run_code
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 545, in
vllm-qwen35-gptq-int4 | (APIServer pid=1) uvloop.run(run_server(args))
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/uvloop/init.py”, line 96, in run
vllm-qwen35-gptq-int4 | (APIServer pid=1) return __asyncio.run(
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/lib/python3.12/asyncio/runners.py”, line 195, in run
vllm-qwen35-gptq-int4 | (APIServer pid=1) return runner.run(main)
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/lib/python3.12/asyncio/runners.py”, line 118, in run
vllm-qwen35-gptq-int4 | (APIServer pid=1) return self._loop.run_until_complete(task)
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “uvloop/loop.pyx”, line 1518, in uvloop.loop.Loop.run_until_complete
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/uvloop/init.py”, line 48, in wrapper
vllm-qwen35-gptq-int4 | (APIServer pid=1) return await main
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 471, in run_server
vllm-qwen35-gptq-int4 | (APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 490, in run_server_worker
vllm-qwen35-gptq-int4 | (APIServer pid=1) async with build_async_engine_client(
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/lib/python3.12/contextlib.py”, line 210, in aenter
vllm-qwen35-gptq-int4 | (APIServer pid=1) return await anext(self.gen)
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 96, in build_async_engine_client
vllm-qwen35-gptq-int4 | (APIServer pid=1) async with build_async_engine_client_from_engine_args(
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/lib/python3.12/contextlib.py”, line 210, in aenter
vllm-qwen35-gptq-int4 | (APIServer pid=1) return await anext(self.gen)
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 122, in build_async_engine_client_from_engine_args
vllm-qwen35-gptq-int4 | (APIServer pid=1) vllm_config = engine_args.create_engine_config(usage_context=usage_context)
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 1468, in create_engine_config
vllm-qwen35-gptq-int4 | (APIServer pid=1) maybe_override_with_speculators(
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/config.py”, line 520, in maybe_override_with_speculators
vllm-qwen35-gptq-int4 | (APIServer pid=1) config_dict, _ = PretrainedConfig.get_config_dict(
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py”, line 662, in get_config_dict
vllm-qwen35-gptq-int4 | (APIServer pid=1) config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py”, line 744, in _get_config_dict
vllm-qwen35-gptq-int4 | (APIServer pid=1) raise OSError(
vllm-qwen35-gptq-int4 | (APIServer pid=1) OSError: Can’t load the configuration of ‘/mnt/sdb/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/b54fd48a0aae8da6594b67889cc370d465362769’. If you were trying to load it from ‘https://huggingface.co/models’, make sure you don’t have a local directory with the same name. Otherwise, make sure ‘/mnt/sdb/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/b54fd48a0aae8da6594b67889cc370d465362769’ is the correct path to a directory containing a config.json file
vllm-qwen35-gptq-int4 exited with code 1 (restarting)
ll /mnt/sdb/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/b54fd48a0aae8da6594b67889cc370d465362769
total 400
drwxr-xr-x 2 root root 12288 Mar 12 16:48 ./
drwxr-xr-x 3 root root 4096 Mar 12 16:32 ../
lrwxrwxrwx 1 root root 52 Mar 12 16:32 chat_template.jinja → ../../blobs/a585dec894e63da457d9440ec6aa7caa16d20860
lrwxrwxrwx 1 root root 52 Mar 12 16:32 config.json → ../../blobs/dfdbf8428a0c67a36660d496952c4491ec571955
lrwxrwxrwx 1 root root 52 Mar 12 16:32 configuration.json → ../../blobs/3a6d425685de8896b2bc8b59b671e41aea1d7bf3
lrwxrwxrwx 1 root root 52 Mar 12 16:32 generation_config.json → ../../blobs/85b45ab4f3a24f95a061c5260559471a259187cc
lrwxrwxrwx 1 root root 52 Mar 12 16:32 .gitattributes → ../../blobs/aa7aacd0134a92c3c1943fdecc75cd8b7420cce6
lrwxrwxrwx 1 root root 52 Mar 12 16:32 LICENSE → ../../blobs/1d5180a42f1c3383ba7c7bd0a50f0837ef0168df
lrwxrwxrwx 1 root root 52 Mar 12 16:32 merges.txt → ../../blobs/a494e019ca1502219fd0128658b979e5f05ae8e8
lrwxrwxrwx 1 root root 76 Mar 12 16:34 model.safetensors-00001-of-00094.safetensors → ../../blobs/1dca8a45d541c1dee9dadd1f88315446b6679c347b091be751733188bb9a056b
[REDACTED]
lrwxrwxrwx 1 root root 76 Mar 12 16:47 model.safetensors.index.json → ../../blobs/f3a03995063801fc5e18c84ceeca2546849a86375e8d9508554dc3f2ffcdc51d
lrwxrwxrwx 1 root root 52 Mar 12 16:47 preprocessor_config.json → ../../blobs/2ea84a437d448ff71b08df68fdd949d5cc4ebb64
lrwxrwxrwx 1 root root 52 Mar 12 16:32 README.md → ../../blobs/a39721f3987c0d9b355685ff41f568146078c15f
lrwxrwxrwx 1 root root 52 Mar 12 16:47 tokenizer_config.json → ../../blobs/eda48d3e75a8e59a8479ee4ec8b37f76e711d9c1
lrwxrwxrwx 1 root root 76 Mar 12 16:48 tokenizer.json → ../../blobs/5f9e4d4901a92b997e463c1f46055088b6cca5ca61a6522d1b9f64c4bb81cb42
lrwxrwxrwx 1 root root 52 Mar 12 16:47 video_preprocessor_config.json → ../../blobs/3ba673a5ad7d4d13f54155ecd38b2a94a6dac8fe
lrwxrwxrwx 1 root root 52 Mar 12 16:47 vocab.json → ../../blobs/0aa0ce0658d60ac4a5d609f4eadb0e8e43514176
hf cache verify Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
Verified 108 file(s) for ‘Qwen/Qwen3.5-397B-A17B-GPTQ-Int4’ (model) in /mnt/sdb/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/b54fd48a0aae8da6594b67889cc370d465362769
All checksums match.
env | grep HF
HF_HOME=/mnt/sdb
cat docker-compose.yaml
services:
vllm:
image: orthozany/vllm-qwen35-mtp
container_name: vllm-qwen35-gptq-int4
ipc: host
ulimits:
memlock:
soft: -1
hard: -1
ports:
- “8000:8000”
environment:
HF_TOKEN: “${HF_TOKEN}”
HF_HOME: “/mnt/sdb”
NCCL_DEBUG: “WARN”
NCCL_SHM_DISABLE: “1”
NCCL_P2P_DISABLE: “1”
NCCL_IB_DISABLE: “1”
NCCL_COMM_BLOCKING: “1”
volumes:
- triton_cache:/root/.triton
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command: >
–model /mnt/sdb/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/b54fd48a0aae8da6594b67889cc370d465362769
–host 0.0.0.0
–quantization modelopt
–tensor-parallel-size 8
–max-model-len 262144
–served-model-name Qwen3.5-397B-A17B-GPTQ-Int4
–enable-prefix-caching
–enable-auto-tool-choice
–tool-call-parser qwen3_coder
–reasoning-parser qwen3
–quantization moe_wna16
restart: unless-stopped
volumes:
triton_cache: