How to register custom out-of-tree models with Transformers Auto Classes for running AsyncLLMEngine in the OPENAI_API_SERVER usage context, without modifying vLLM source code?
Context
I’m currently doing so by calling my registration method from within vllm/model_executor/models/registry.py, which works fine, but I’d like to avoid modifying vLLM source code to speed-up vLLM version bumps in deployments.
As I understand it, the V1 Engine spawns Workers but it seems like the Transformers registration, if done from the top-level app, does not carry over to the Workers that error out while interpreting HF configs.
To register custom out-of-tree models (OOT) with Transformers Auto Classes for use with AsyncLLMEngine and the OpenAI API server—without modifying vLLM source code—you should use vLLM’s plugin system. This ensures your registration code is executed in every worker process, which is necessary for distributed and multiprocessing setups, including V1 Engine and AsyncLLMEngine. Place your model registration logic in a plugin entrypoint, and install your plugin package so vLLM can discover and load it automatically at startup. This approach is officially recommended and avoids the pitfalls of top-level registration, which does not propagate to worker processes and leads to errors when loading custom models in distributed contexts. See vLLM’s Plugin System documentation and Model Registration guide for details.
Example plugin registration code (in your plugin’s entrypoint, e.g., register()):
def register():
from vllm import ModelRegistry
from transformers import AutoConfig
from your_code import YourModelForCausalLM, YourConfigClass
AutoConfig.register("your_model_type", YourConfigClass, exist_ok=True)
ModelRegistry.register_model("YourModelForCausalLM", YourModelForCausalLM)
Then, set up your package’s setup.py with an entry point under vllm.general_plugins, and install it in your environment. This ensures the registration is executed in all vLLM processes, including workers spawned by the API server or AsyncLLMEngine, without modifying vLLM source code. Would you like a step-by-step example of the plugin packaging and installation process?