Using vLLM on a HF model architecture modified locally

mmcdermott · July 7, 2025, 8:49pm

Hi Folks,

I’ve implemented an autoregressive model that predicts not the next word in a sequence, but rather the next medical event in a patient’s record. It converts the medical record into a sequence of categorical tokens, then uses a HF transformers model architecture (not the pretrained weights, just the base encoder) as the backbone for pre-training and for generation calls.

You can see my specific model’s generate function here (MEDS_EIC_AR/src/MEDS_EIC_AR/model/model.py at main · mmcdermott/MEDS_EIC_AR · GitHub) and my overall codebase here: GitHub - mmcdermott/MEDS_EIC_AR: A MEDS, "Everything-is-code" style Autoregressive Generative Model, capable of zero-shot inference.

I think based on the fact that GPT-NeoX is supported by vLLM (and that is the base architecture I use) that it should be possible to use vLLM to accelerate the generation process for my custom model as well – does anyone have any pointers on how to go about doing that? Any advice would be greatly appreciated. Thanks!

RunLLM · July 7, 2025, 8:49pm

Yes, if your model uses the GPT-NeoX architecture (even with custom tokens and a custom generate function), you can use vLLM to accelerate inference, provided your model is compatible with vLLM’s requirements. vLLM supports custom models if they are either natively supported (like GPT-NeoX) or compatible with the Transformers backend, and you can register your own model if needed. For custom models, you may need to register your model with vLLM, either by adding it to the vLLM codebase or by using the plugin system for out-of-tree models. See the model registration guide and supported models documentation for details.

If your model is not directly supported, you should implement your model class in the vLLM format and register it, ensuring it matches the expected interface (e.g., for generative models, implement the VllmModelForTextGeneration interface). You may also need to adapt your tokenizer to be compatible with vLLM’s requirements, as vLLM expects a HuggingFace-compatible tokenizer for text generation tasks. For more, see writing custom models and model registration. Would you like a step-by-step breakdown of how to adapt and register your model for vLLM?

Sources:

Topic		Replies	Views
How to serve a transformed Pytorch model General	1	22	September 2, 2025
RunBot's math-to-text on NVIDIA NeMo Framework AutoModel LoRA	11	50	May 19, 2025
Registering with Transformers Auto Classes for out-of-tree models General	1	38	July 29, 2025
Is there a newly example to show how to add a new LLM into vLLM? General	3	27	June 10, 2025
Does VLLM support BERT model General	2	89	April 7, 2025

Using vLLM on a HF model architecture modified locally

Related topics