We want to setup our vLLM instance in an airgapped environment (e.g. pulling the necessary model(s) once and then disconnecting/ blocking internet access). The models are downloaded to the local cache volume of the instance, but when disconnecting internet the vLLM instance does not start up properly:
To run vLLM in an airgapped environment, you must (1) download the model and all required files (e.g., config.json, tokenizer files) to a local directory, and (2) start vLLM by pointing directly to this local directory, not by model name. Additionally, set the environment variable HF_HUB_OFFLINE=1 to prevent vLLM from attempting to connect to the internet. If you use a local cache, you may need to provide the full snapshot path to the model inside the cache, not just the model name. For gated/private models, ensure all files are present, as vLLM does not always download every required file automatically. Example:
If you still see errors, double-check that the directory contains all necessary files (e.g., config.json, tokenizer files, model weights) and that you are not using the --model argument with a HuggingFace model name (e.g., “facebook/opt-125m”) but rather with the local path. This approach is confirmed in multiple issues and is the recommended workaround for airgapped/offline deployments according to vLLM GitHub issues, troubleshooting docs, and offline inference docs.
Would you like a step-by-step guide for preparing the local model directory or for troubleshooting specific errors?
Thanks @RunLLM!
This was exactly what I’m looking for.
Just one point to add: If the models have been downloaded to the local cache & the environment variable HF_HUB_OFFLINE=1 is set, you don’t need to specify the ultimate path to the model. It’s enough to use the repoName/model format (e.g. Qwen/Qwen3-235B-A22B-FP8
Correct—if the model and all required files are present in the local HuggingFace cache, and HF_HUB_OFFLINE=1 is set, you can use the repoName/model format (e.g., Qwen/Qwen3-235B-A22B-FP8) as the model argument. vLLM will resolve the model from the local cache without attempting any network calls, as confirmed in vLLM GitHub Issue #4859 and offline inference documentation.
Would you like more detail on how vLLM resolves models from the cache or how to verify your cache contents?