Trying to run Qwen3.5-397B-A17B-GPTQ-Int4

@RunLLM , I’m trying to run Qwen3.5-397B-A17B-GPTQ-Int4 on Ubuntu 22. Is this supported? My error message:

vllm-qwen35-gptq-int4 | (APIServer pid=1) INFO 03-12 23:12:44 [utils.py:293]
vllm-qwen35-gptq-int4 | (APIServer pid=1) INFO 03-12 23:12:44 [utils.py:293] █ █ █▄ ▄█
vllm-qwen35-gptq-int4 | (APIServer pid=1) INFO 03-12 23:12:44 [utils.py:293] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.16.1rc1.dev173+g8fa68a8ce
vllm-qwen35-gptq-int4 | (APIServer pid=1) INFO 03-12 23:12:44 [utils.py:293] █▄█▀ █ █ █ █ model /mnt/sdb/Qwen3.5-397B-A17B-GPTQ-Int4
vllm-qwen35-gptq-int4 | (APIServer pid=1) INFO 03-12 23:12:44 [utils.py:293] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
vllm-qwen35-gptq-int4 | (APIServer pid=1) INFO 03-12 23:12:44 [utils.py:293]
vllm-qwen35-gptq-int4 | (APIServer pid=1) INFO 03-12 23:12:44 [utils.py:229] non-default args: {‘enable_auto_tool_choice’: True, ‘tool_call_parser’: ‘qwen3_coder’, ‘host’: ‘0.0.0.0’, ‘model’: ‘/mnt/sdb/Qwen3.5-397B-A17B-GPTQ-Int4’, ‘tokenizer’: ‘/mnt/sdb/Qwen3.5-397B-A17B-GPTQ-Int4’, ‘trust_remote_code’: True, ‘max_model_len’: 262144, ‘quantization’: ‘moe_wna16’, ‘served_model_name’: [‘Qwen3.5-397B-A17B-GPTQ-Int4’], ‘reasoning_parser’: ‘qwen3’, ‘tensor_parallel_size’: 8, ‘enable_prefix_caching’: True}
vllm-qwen35-gptq-int4 | (APIServer pid=1) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
vllm-qwen35-gptq-int4 | (APIServer pid=1) Traceback (most recent call last):
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py”, line 479, in cached_files
vllm-qwen35-gptq-int4 | (APIServer pid=1) hf_hub_download(
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_validators.py”, line 106, in _inner_fn
vllm-qwen35-gptq-int4 | (APIServer pid=1) validate_repo_id(arg_value)
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_validators.py”, line 154, in validate_repo_id
vllm-qwen35-gptq-int4 | (APIServer pid=1) raise HFValidationError(
vllm-qwen35-gptq-int4 | (APIServer pid=1) huggingface_hub.errors.HFValidationError: Repo id must be in the form ‘repo_name’ or ‘namespace/repo_name’: ‘/mnt/sdb/Qwen3.5-397B-A17B-GPTQ-Int4’. Use repo_type argument if needed.

Yes, Qwen3.5-397B-A17B-GPTQ-Int4 is supported on Ubuntu 22 with vLLM, but your error is due to specifying the model path incorrectly. For local models downloaded from Hugging Face, you must provide the path to the actual snapshot directory containing config.json, not just the parent folder. For example, use a path like /mnt/sdb/Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/ instead of /mnt/sdb/Qwen3.5-397B-A17B-GPTQ-Int4. This is a common issue and is documented in several vLLM discussions and guides.

See the guidance in this issue and this one for details. Make sure your model directory contains config.json and other model files directly, not in a subfolder. Adjusting the path should resolve the HFValidationError.

Would you like more detail or a step-by-step example?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Due to the large size, I need the model to be stored in /mnt/sdb/Qwen3.5-397B-A17B-GPTQ-Int4/. What are step by step examples to update or move ~/.cache/huggingface/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4 so that model can load from /mnt/sdb/Qwen3.5-397B-A17B-GPTQ-Int4 ?

To store and load your model from /mnt/sdb/Qwen3.5-397B-A17B-GPTQ-Int4 instead of ~/.cache/huggingface/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4, you can either move the snapshot directory or set the HF_HOME environment variable to point to your new location. Here are two step-by-step options:

Option 1: Move the snapshot directory

  1. Find the snapshot directory inside ~/.cache/huggingface/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/.
  2. Copy or move the contents of this snapshot (the directory containing config.json, model weights, etc.) to /mnt/sdb/Qwen3.5-397B-A17B-GPTQ-Int4.
  3. When launching vLLM, set --model /mnt/sdb/Qwen3.5-397B-A17B-GPTQ-Int4.

Option 2: Use HF_HOME to change the cache location

  1. Set the environment variable: export HF_HOME=/mnt/sdb before running vLLM.
  2. Download the model using the Hugging Face CLI:
    HF_HOME=/mnt/sdb hf download Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
    
  3. vLLM will now use /mnt/sdb as the cache and load the model from there automatically.

Both methods are supported and described in this issue and the official docs.

Would you like more detail or a sample command for your use case?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM The error continues. Here is the error log, the listing of the path, the hf cache verify, and my docker-compose.yaml:

vllm-qwen35-gptq-int4 | (APIServer pid=1) INFO 03-13 00:10:18 [utils.py:293]
vllm-qwen35-gptq-int4 | (APIServer pid=1) INFO 03-13 00:10:18 [utils.py:293] █ █ █▄ ▄█
vllm-qwen35-gptq-int4 | (APIServer pid=1) INFO 03-13 00:10:18 [utils.py:293] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.16.1rc1.dev173+g8fa68a8ce
vllm-qwen35-gptq-int4 | (APIServer pid=1) INFO 03-13 00:10:18 [utils.py:293] █▄█▀ █ █ █ █ model /mnt/sdb/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/b54fd48a0aae8da6594b67889cc370d465362769
vllm-qwen35-gptq-int4 | (APIServer pid=1) INFO 03-13 00:10:18 [utils.py:293] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
vllm-qwen35-gptq-int4 | (APIServer pid=1) INFO 03-13 00:10:18 [utils.py:293]
vllm-qwen35-gptq-int4 | (APIServer pid=1) INFO 03-13 00:10:18 [utils.py:229] non-default args: {‘enable_auto_tool_choice’: True, ‘tool_call_parser’: ‘qwen3_coder’, ‘host’: ‘0.0.0.0’, ‘model’: ‘/mnt/sdb/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/b54fd48a0aae8da6594b67889cc370d465362769’, ‘max_model_len’: 262144, ‘quantization’: ‘moe_wna16’, ‘served_model_name’: [‘Qwen3.5-397B-A17B-GPTQ-Int4’], ‘reasoning_parser’: ‘qwen3’, ‘tensor_parallel_size’: 8, ‘enable_prefix_caching’: True}
vllm-qwen35-gptq-int4 | (APIServer pid=1) Traceback (most recent call last):
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py”, line 479, in cached_files
vllm-qwen35-gptq-int4 | (APIServer pid=1) hf_hub_download(
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_validators.py”, line 106, in _inner_fn
vllm-qwen35-gptq-int4 | (APIServer pid=1) validate_repo_id(arg_value)
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_validators.py”, line 154, in validate_repo_id
vllm-qwen35-gptq-int4 | (APIServer pid=1) raise HFValidationError(
vllm-qwen35-gptq-int4 | (APIServer pid=1) huggingface_hub.errors.HFValidationError: Repo id must be in the form ‘repo_name’ or ‘namespace/repo_name’: ‘/mnt/sdb/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/b54fd48a0aae8da6594b67889cc370d465362769’. Use repo_type argument if needed.
vllm-qwen35-gptq-int4 | (APIServer pid=1)
vllm-qwen35-gptq-int4 | (APIServer pid=1) During handling of the above exception, another exception occurred:
vllm-qwen35-gptq-int4 | (APIServer pid=1)
vllm-qwen35-gptq-int4 | (APIServer pid=1) Traceback (most recent call last):
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py”, line 721, in _get_config_dict
vllm-qwen35-gptq-int4 | (APIServer pid=1) resolved_config_file = cached_file(
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py”, line 322, in cached_file
vllm-qwen35-gptq-int4 | (APIServer pid=1) file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py”, line 532, in cached_files
vllm-qwen35-gptq-int4 | (APIServer pid=1) _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type)
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py”, line 143, in _get_cache_file_to_return
vllm-qwen35-gptq-int4 | (APIServer pid=1) resolved_file = try_to_load_from_cache(
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_validators.py”, line 106, in _inner_fn
vllm-qwen35-gptq-int4 | (APIServer pid=1) validate_repo_id(arg_value)
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_validators.py”, line 154, in validate_repo_id
vllm-qwen35-gptq-int4 | (APIServer pid=1) raise HFValidationError(
vllm-qwen35-gptq-int4 | (APIServer pid=1) huggingface_hub.errors.HFValidationError: Repo id must be in the form ‘repo_name’ or ‘namespace/repo_name’: ‘/mnt/sdb/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/b54fd48a0aae8da6594b67889cc370d465362769’. Use repo_type argument if needed.
vllm-qwen35-gptq-int4 | (APIServer pid=1)
vllm-qwen35-gptq-int4 | (APIServer pid=1) During handling of the above exception, another exception occurred:
vllm-qwen35-gptq-int4 | (APIServer pid=1)
vllm-qwen35-gptq-int4 | (APIServer pid=1) Traceback (most recent call last):
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “”, line 198, in _run_module_as_main
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “”, line 88, in _run_code
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 545, in
vllm-qwen35-gptq-int4 | (APIServer pid=1) uvloop.run(run_server(args))
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/uvloop/init.py”, line 96, in run
vllm-qwen35-gptq-int4 | (APIServer pid=1) return __asyncio.run(
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/lib/python3.12/asyncio/runners.py”, line 195, in run
vllm-qwen35-gptq-int4 | (APIServer pid=1) return runner.run(main)
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/lib/python3.12/asyncio/runners.py”, line 118, in run
vllm-qwen35-gptq-int4 | (APIServer pid=1) return self._loop.run_until_complete(task)
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “uvloop/loop.pyx”, line 1518, in uvloop.loop.Loop.run_until_complete
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/uvloop/init.py”, line 48, in wrapper
vllm-qwen35-gptq-int4 | (APIServer pid=1) return await main
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 471, in run_server
vllm-qwen35-gptq-int4 | (APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 490, in run_server_worker
vllm-qwen35-gptq-int4 | (APIServer pid=1) async with build_async_engine_client(
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/lib/python3.12/contextlib.py”, line 210, in aenter
vllm-qwen35-gptq-int4 | (APIServer pid=1) return await anext(self.gen)
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 96, in build_async_engine_client
vllm-qwen35-gptq-int4 | (APIServer pid=1) async with build_async_engine_client_from_engine_args(
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/lib/python3.12/contextlib.py”, line 210, in aenter
vllm-qwen35-gptq-int4 | (APIServer pid=1) return await anext(self.gen)
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 122, in build_async_engine_client_from_engine_args
vllm-qwen35-gptq-int4 | (APIServer pid=1) vllm_config = engine_args.create_engine_config(usage_context=usage_context)
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 1468, in create_engine_config
vllm-qwen35-gptq-int4 | (APIServer pid=1) maybe_override_with_speculators(
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/config.py”, line 520, in maybe_override_with_speculators
vllm-qwen35-gptq-int4 | (APIServer pid=1) config_dict, _ = PretrainedConfig.get_config_dict(
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py”, line 662, in get_config_dict
vllm-qwen35-gptq-int4 | (APIServer pid=1) config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
vllm-qwen35-gptq-int4 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-qwen35-gptq-int4 | (APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py”, line 744, in _get_config_dict
vllm-qwen35-gptq-int4 | (APIServer pid=1) raise OSError(
vllm-qwen35-gptq-int4 | (APIServer pid=1) OSError: Can’t load the configuration of ‘/mnt/sdb/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/b54fd48a0aae8da6594b67889cc370d465362769’. If you were trying to load it from ‘https://huggingface.co/models’, make sure you don’t have a local directory with the same name. Otherwise, make sure ‘/mnt/sdb/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/b54fd48a0aae8da6594b67889cc370d465362769’ is the correct path to a directory containing a config.json file
vllm-qwen35-gptq-int4 exited with code 1 (restarting)

ll /mnt/sdb/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/b54fd48a0aae8da6594b67889cc370d465362769

total 400
drwxr-xr-x 2 root root 12288 Mar 12 16:48 ./
drwxr-xr-x 3 root root 4096 Mar 12 16:32 ../
lrwxrwxrwx 1 root root 52 Mar 12 16:32 chat_template.jinja → ../../blobs/a585dec894e63da457d9440ec6aa7caa16d20860
lrwxrwxrwx 1 root root 52 Mar 12 16:32 config.json → ../../blobs/dfdbf8428a0c67a36660d496952c4491ec571955
lrwxrwxrwx 1 root root 52 Mar 12 16:32 configuration.json → ../../blobs/3a6d425685de8896b2bc8b59b671e41aea1d7bf3
lrwxrwxrwx 1 root root 52 Mar 12 16:32 generation_config.json → ../../blobs/85b45ab4f3a24f95a061c5260559471a259187cc
lrwxrwxrwx 1 root root 52 Mar 12 16:32 .gitattributes → ../../blobs/aa7aacd0134a92c3c1943fdecc75cd8b7420cce6
lrwxrwxrwx 1 root root 52 Mar 12 16:32 LICENSE → ../../blobs/1d5180a42f1c3383ba7c7bd0a50f0837ef0168df
lrwxrwxrwx 1 root root 52 Mar 12 16:32 merges.txt → ../../blobs/a494e019ca1502219fd0128658b979e5f05ae8e8
lrwxrwxrwx 1 root root 76 Mar 12 16:34 model.safetensors-00001-of-00094.safetensors → ../../blobs/1dca8a45d541c1dee9dadd1f88315446b6679c347b091be751733188bb9a056b
[REDACTED]
lrwxrwxrwx 1 root root 76 Mar 12 16:47 model.safetensors.index.json → ../../blobs/f3a03995063801fc5e18c84ceeca2546849a86375e8d9508554dc3f2ffcdc51d
lrwxrwxrwx 1 root root 52 Mar 12 16:47 preprocessor_config.json → ../../blobs/2ea84a437d448ff71b08df68fdd949d5cc4ebb64
lrwxrwxrwx 1 root root 52 Mar 12 16:32 README.md → ../../blobs/a39721f3987c0d9b355685ff41f568146078c15f
lrwxrwxrwx 1 root root 52 Mar 12 16:47 tokenizer_config.json → ../../blobs/eda48d3e75a8e59a8479ee4ec8b37f76e711d9c1
lrwxrwxrwx 1 root root 76 Mar 12 16:48 tokenizer.json → ../../blobs/5f9e4d4901a92b997e463c1f46055088b6cca5ca61a6522d1b9f64c4bb81cb42
lrwxrwxrwx 1 root root 52 Mar 12 16:47 video_preprocessor_config.json → ../../blobs/3ba673a5ad7d4d13f54155ecd38b2a94a6dac8fe
lrwxrwxrwx 1 root root 52 Mar 12 16:47 vocab.json → ../../blobs/0aa0ce0658d60ac4a5d609f4eadb0e8e43514176

hf cache verify Qwen/Qwen3.5-397B-A17B-GPTQ-Int4

:white_check_mark: Verified 108 file(s) for ‘Qwen/Qwen3.5-397B-A17B-GPTQ-Int4’ (model) in /mnt/sdb/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/b54fd48a0aae8da6594b67889cc370d465362769
All checksums match.

env | grep HF

HF_HOME=/mnt/sdb

cat docker-compose.yaml

services:
vllm:
image: orthozany/vllm-qwen35-mtp
container_name: vllm-qwen35-gptq-int4
ipc: host
ulimits:
memlock:
soft: -1
hard: -1
ports:

  • “8000:8000”
    environment:
    HF_TOKEN: “${HF_TOKEN}”
    HF_HOME: “/mnt/sdb”
    NCCL_DEBUG: “WARN”
    NCCL_SHM_DISABLE: “1”
    NCCL_P2P_DISABLE: “1”
    NCCL_IB_DISABLE: “1”
    NCCL_COMM_BLOCKING: “1”
    volumes:
  • triton_cache:/root/.triton
    deploy:
    resources:
    reservations:
    devices:
  • driver: nvidia
    count: all
    capabilities: [gpu]
    command: >
    –model /mnt/sdb/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/b54fd48a0aae8da6594b67889cc370d465362769
    –host 0.0.0.0
    –quantization modelopt
    –tensor-parallel-size 8
    –max-model-len 262144
    –served-model-name Qwen3.5-397B-A17B-GPTQ-Int4
    –enable-prefix-caching
    –enable-auto-tool-choice
    –tool-call-parser qwen3_coder
    –reasoning-parser qwen3
    –quantization moe_wna16
    restart: unless-stopped

volumes:
triton_cache:

Your error is because vLLM (and Hugging Face Transformers) does not support loading models from a directory where the files are symlinks to a “blobs” directory, as is the case with the Hugging Face cache structure (i.e., the snapshots/* directories). vLLM expects a “flat” directory with actual files (not symlinks) for config.json, model weights, etc. This is a common source of confusion when using the Hugging Face cache directly as a model path in vLLM Docker deployments.

Solution:
Copy (not symlink) all files from your snapshot directory to a new flat directory, then point vLLM to that directory.

Step-by-step:

  1. Create a new directory for the model:
    mkdir -p /mnt/sdb/qwen35-flat
    
  2. Copy all files from the snapshot directory, resolving symlinks:
    cp -L /mnt/sdb/hub/models–Qwen–Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/b54fd48a0aae8da6594b67889cc370d465362769/* /mnt/sdb/qwen35-flat/
    
    The -L flag ensures symlinks are resolved and actual files are copied.
  3. Update your docker-compose.yaml to use:
    --model /mnt/sdb/qwen35-flat
    
    instead of the snapshot path.

This approach is confirmed in vLLM issue discussions and is the recommended way to use local models with vLLM Docker.

Would you like a sample docker-compose.yaml snippet or further troubleshooting steps?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM do the files need to exist in both locations or can they be moved from snapshots to /mnt/sdb/qwen35-flat?

The files do not need to exist in both locations. You can move (not just copy) all files from the snapshot directory to /mnt/sdb/qwen35-flat, as long as /mnt/sdb/qwen35-flat contains all required model files (e.g., config.json, model weights, tokenizer files, etc.) directly in that directory and not as symlinks. vLLM only needs access to the flat directory you specify with --model; the original snapshot location is not required for inference if you are not using the Hugging Face cache elsewhere.

This is confirmed in Running from Docker with Local Model Files Failed.

Would you like a list of required files for vLLM to recognize the model directory?

Sources:

Hint: Mention @RunLLM in the post for followups.

I feel a bit spurned by the lack of response, @RunLLM, but I did figure out my issue so I guess the silent treatment was your way of saying “I’ve given you the tools, now use your thinking brain to finish this”.

@RunLLM how can I speed up prompt and generation throughput with this model, Qwen3.5-397B-A17B-GPTQ-Int4. I have 8 qty A6000 GPUs (384G VRAM) on Ubuntu 22, nvidia Driver Version: 535.104.05 and CUDA Version: 12.2. This is my docker-compose.yaml:

services:
  vllm:
    image: orthozany/vllm-qwen35-mtp
    container_name: vllm-qwen35-gpt4
    ipc: host
    ulimits:
      memlock: { soft: -1, hard: -1 }
    ports:
      - "8000:8000"
    environment:
      HF_TOKEN: "${HF_TOKEN}"
      HF_HOME: "/mnt/llm_storage"
      HF_CACHE_DIR: "/mnt/llm_storage"
      TRANSFORMERS_CACHE: "/mnt/llm_storage/cache"  # ensure this subdir is in hf_cache
      TRITON_CACHE_DIR: "/triton_cache"
      NCCL_DEBUG: "WARN"
      NCCL_SHM_DISABLE: "1"
      NCCL_P2P_DISABLE: "1"
      NCCL_IB_DISABLE: "1"
      NCCL_COMM_BLOCKING: "1"
    volumes:
      - hf_cache:/mnt/llm_storage   # mount the *root* of the storage, not a subfolder — HF needs to create the model dir if it doesn't exist
      - triton_cache:/triton_cache:rw
      # if you want to use a temp dir separate from hf cache:
      # - tmpfs:/tmp:rw,noexec,nosuid,size=16G

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

    # Make vLLM aware of local model path
    command: >
      --model /mnt/llm_storage/Qwen3.5-397B-A17B-GPTQ-Int4
      --host 0.0.0.0
      --quantization modelopt
      --tensor-parallel-size 8
      --max-model-len 262144
      --served-model-name Qwen3.5-397B-A17B-GPTQ-Int4
      --enable-prefix-caching
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --reasoning-parser qwen3
      --quantization moe_wna16

    restart: unless-stopped

volumes:
  hf_cache:   # this will be the root for HF cache too
  open_webui_data:
  triton_cache:

To maximize prompt and generation throughput for Qwen3.5-397B-A17B-GPTQ-Int4 on 8xA6000, increase --max-num-batched-tokens (e.g., 16384 or higher), and ensure --gpu-memory-utilization is set high (e.g., 0.95) to maximize KV cache. Also, avoid duplicate --quantization flags in your command, and use only one quantization method. For large models, set --tensor-parallel-size 8 (as you have), and consider enabling --enable-expert-parallel if the model supports MoE.

For further gains, tune max_num_seqs (e.g., 16–64 for high concurrency), and ensure your CPU has enough cores (at least 10 for 8 GPUs plus API/engine). See Optimization and Tuning and Performance Tuning with Chunked Prefill.

Would you like a sample command or more detail on tuning these parameters?

Sources:

Hint: Mention @RunLLM in the post for followups.