How to inference or deploy with my custom model

How to inference or deploy with my custom model ? What code adaptations do I need to make ?

To deploy or run inference with your custom model in vLLM, you must adapt your code to meet vLLM’s requirements: (1) All modules must accept a prefix argument in their constructor for correct attention registration and quantization; (2) Implement a get_input_embeddings method; (3) Rewrite the forward method to accept flattened input_ids and positions; (4) If your model uses a non-standard attention mechanism, you must implement a new attention layer; (5) Optionally, add tensor parallelism and quantization support; (6) Register your model with vLLM if it is not already recognized. See Implementing a Basic Model and Registering a Model for step-by-step instructions.

For models compatible with HuggingFace Transformers, ensure your model directory has the correct structure (e.g., config.json present), and config.json contains auto_map.AutoModel. If your model is not natively supported, you can register it as an out-of-tree model using ModelRegistry. For more details and code examples, refer to Supported Models and Registering a Model.

Would you like a detailed step-by-step code example?

Sources:

Hint: Mention @RunLLM in the post for followups.