Hi Folks,
I’ve implemented an autoregressive model that predicts not the next word in a sequence, but rather the next medical event in a patient’s record. It converts the medical record into a sequence of categorical tokens, then uses a HF transformers model architecture (not the pretrained weights, just the base encoder) as the backbone for pre-training and for generation calls.
You can see my specific model’s generate function here (MEDS_EIC_AR/src/MEDS_EIC_AR/model/model.py at main · mmcdermott/MEDS_EIC_AR · GitHub) and my overall codebase here: GitHub - mmcdermott/MEDS_EIC_AR: A MEDS, "Everything-is-code" style Autoregressive Generative Model, capable of zero-shot inference.
I think based on the fact that GPT-NeoX is supported by vLLM (and that is the base architecture I use) that it should be possible to use vLLM to accelerate the generation process for my custom model as well – does anyone have any pointers on how to go about doing that? Any advice would be greatly appreciated. Thanks!
Yes, if your model uses the GPT-NeoX architecture (even with custom tokens and a custom generate function), you can use vLLM to accelerate inference, provided your model is compatible with vLLM’s requirements. vLLM supports custom models if they are either natively supported (like GPT-NeoX) or compatible with the Transformers backend, and you can register your own model if needed. For custom models, you may need to register your model with vLLM, either by adding it to the vLLM codebase or by using the plugin system for out-of-tree models. See the model registration guide and supported models documentation for details.
If your model is not directly supported, you should implement your model class in the vLLM format and register it, ensuring it matches the expected interface (e.g., for generative models, implement the VllmModelForTextGeneration interface). You may also need to adapt your tokenizer to be compatible with vLLM’s requirements, as vLLM expects a HuggingFace-compatible tokenizer for text generation tasks. For more, see writing custom models and model registration. Would you like a step-by-step breakdown of how to adapt and register your model for vLLM?
Sources: