I’m new to vLLM. I’m currently working on extending a multimodal model within vLLM, and I would like to add support for a custom modality (for example ECG (electrocardiogram) signals).
Is it possible to introduce an entirely new modality that is not part of the existing set (i.e., not image/video/audio), provided that I supply a compatible encoder and projection layer?
In other words, can I register and handle a new input type (e.g., ECG tensors), implement custom parsing/validation, and pass the resulting embeddings into the multimodal pipeline in the same way that vision/audio/video modalities are currently integrated?
If so, are there recommended patterns or extension points in vLLM for defining custom modalities beyond the built-in ones?
Any guidance or examples would be greatly appreciated.
Yes, you can introduce a new custom modality (such as ECG signals) in vLLM by implementing a compatible encoder, projection layer, and the necessary processing logic. vLLM’s multimodal support is designed to be extensible: you can define new modalities by updating your model to accept additional keyword arguments in forward(), implementing a custom get_multimodal_embeddings() method, and registering your modality’s processor. The returned embeddings must follow the expected tensor shapes, and you can merge them with text embeddings as with built-in modalities. See the Multi-Modal Support guide for the required steps and extension points.
Specifically, you should implement a custom processor for your modality, register it in the multimodal registry, and ensure your model class inherits from SupportsMultiModal. This allows vLLM to parse, validate, and process your new input type in the same way as image/audio/video. There are no restrictions on the modality name or type, as long as you provide the necessary logic for encoding and merging embeddings. For more details, refer to the official documentation.
Would you like a step-by-step example or code snippet for registering a custom modality?