I have a Python script that takes an instantiated Pytorch model and:
Add some layers to it (replace some layers by custom layers with standard ops)
Change the weights + biases of each layer
I made the assumption that “key” layers like attention will not be modified, only new layers will be added between.
I would like to integrate a such code with VLLM.
What are your recommendations in terms of integration? I know that VLLM is not handling already instantiated models.
vLLM does not support loading already-instantiated PyTorch models directly; it requires loading models from disk or the Hugging Face Hub using a model path or identifier, not a Python object. To integrate your modified model with vLLM, you must save your model (with custom layers and updated weights) in a Hugging Face-compatible format (including config.json and model weights), then load it via vLLM using the model directory path. If your architecture is not natively supported, you must adapt your model to vLLM’s requirements and register it as a custom model, following the steps in the vLLM documentation for adding and registering new models, including implementing required interfaces and weight loading logic. See this issue and the basic model integration guide.
If your modifications are minor and the model remains compatible with Hugging Face Transformers, you can use the Transformers backend in vLLM by ensuring your model directory has the correct structure and config. For more complex changes (e.g., new attention mechanisms), you must implement the model in vLLM’s codebase and register it. See supported models and custom model integration for details. Would you like a step-by-step breakdown of the integration process?