How to expose v1/audio/transcriptions router for custom models

how to expose v1/audio/transcriptions router for custom models

To expose the /v1/audio/transcriptions endpoint for a custom model in vLLM, your model must implement the SupportsTranscription interface and be registered in the model registry. You need to define required class attributes and methods such as supported_languages, supports_transcription_only, get_speech_to_text_config, and get_generation_prompt. Once these are implemented, vLLM will automatically advertise transcription support and initialize the endpoint—no extra registration is needed beyond this interface implementation. See docs and example code.

If your model does not implement SupportsTranscription, the endpoint will return an error stating the model does not support the Transcriptions API. For multimodal models, follow the multimodal prompt construction pattern as described in the documentation. Would you like a step-by-step code example?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM adding SupportsTranscription and required functions still don’t work

If you’ve added SupportsTranscription and the required functions but /v1/audio/transcriptions still doesn’t work, ensure your model is properly registered in the vLLM model registry and that all required class attributes and methods are correctly implemented. Also, verify that your server is running with the correct model and that the model advertises transcription support (i.e., supports_transcription = True). If these are set, vLLM should automatically expose the endpoint—no manual router registration is needed. See docs.

If the endpoint is still missing or returns an error, check server logs for import or registration issues, and confirm your model is listed in the supported models for transcription. Some models may require additional configuration or dependencies (e.g., audio libraries). Would you like troubleshooting steps or example code?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM here is my code
@MULTIMODAL_REGISTRY.register_processor(

FunASRMultiModalProcessor,

info=FunASRProcessingInfo,

dummy_inputs=FunASRDummyInputsBuilder,

)

class FunASRForConditionalGeneration(

nn.Module,

SupportsTranscription,

SupportsMultiModal

):

packed_modules_mapping = {

“self_attn.qkv_proj”: [

“self_attn.q_proj”,

“self_attn.k_proj”,

“self_attn.v_proj”,

    \],

“encoder_attn.kv_proj”: [“encoder_attn.k_proj”, “encoder_attn.v_proj”],

}



hf_to_vllm_mapper = WeightsMapper(

orig_to_new_substr={

“linear_q.”: “q_proj.”,

“linear_k.”: “k_proj.”,

“linear_v.”: “v_proj.”,

“linear_out.”: “out_proj.”,

    }

)



supports_transcription_only = True

supports_segment_timestamp = True

supported_languages = ISO639_1_SUPPORTED_LANGS

@classmethod

def validate_language(cls, language: str | None) → str | None:

if language is None:

# TODO language should be optional and can be guessed.

# For now we default to en. See

# transformers/src/transformers/models/whisper/generation_whisper.py at main · huggingface/transformers · GitHub

        logger.warning(

"Defaulting to language=‘en’. If you wish to transcribe "

"audio in a different language, pass the `language` field "

“in the TranscriptionRequest.”

        )

        language = "en"

return super().validate_language(language)

@classmethod

def get_generation_prompt(

cls,

audio: np.ndarray,

model_config: ModelConfig, # not needed here

stt_config: SpeechToTextConfig,

language: str | None,

task_type: Literal[“transcribe”, “translate”],

request_prompt: str,

to_language: str | None,

) -> PromptType:

# processor = cached_processor_from_config(model_config)

if language is None:

raise ValueError(

“Language must be specified when creating the funasr prompt”

        )



    funasr_prompt = "<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n<|im_start|>user\\n语音转写:<|AUDIO|><|im_end|>\\n<|im_start|>assistant\\n"  # noqa: E501

    prompt = {

“prompt”: funasr_prompt,

“multi_modal_data”: {

“audio”: (audio, stt_config.sample_rate),

        },

    }

return cast(PromptType, prompt)

@classmethod

def get_speech_to_text_config(

cls, model_config: ModelConfig, task_type: str

) -> SpeechToTextConfig:

    processor = cached_processor_from_config(model_config)

return SpeechToTextConfig(

max_audio_clip_s=processor.feature_extractor.chunk_length,

sample_rate=processor.feature_extractor.sampling_rate,

    )

@classmethod

def get_num_audio_tokens(

cls,

audio_duration_s: float,

stt_config: SpeechToTextConfig,

model_config: ModelConfig,

) -> int | None:

    processor = cached_processor_from_config(model_config)

    hop_length = processor.feature_extractor.hop_length

assert hop_length is not None

return math.ceil(audio_duration_s * stt_config.sample_rate / hop_length)

def _init_(self, *, vllm_config: VllmConfig, prefix: str = “”):

super()._init_()

    config = vllm_config.model_config.hf_config

    quant_config = vllm_config.quant_config

self.config = config

self.dtype = vllm_config.model_config.dtype

self.model = FunASRModel(

vllm_config=vllm_config,

prefix=maybe_prefix(prefix, “model”),

    )

    logit_scale = getattr(config, "logit_scale", 1.0)

if config.tie_word_embeddings:

self.lm_head = self.model.decoder.embed_tokens

else:

self.lm_head = ParallelLMHead(

            config.vocab_size,

            config.hidden_size,

quant_config=quant_config,

prefix=maybe_prefix(prefix, “lm_head”),

        )

self.logits_processor = LogitsProcessor(config.vocab_size, scale=logit_scale)

def forward(

self,

input_ids: torch.Tensor,

positions: torch.Tensor,

inputs_embeds: torch.Tensor | None = None,

**kwargs,

) -> torch.Tensor:

    decoder_outputs = self.model(

input_ids=input_ids,

positions=positions,

inputs_embeds=inputs_embeds,

    )

return decoder_outputs

def get_language_model(self) → torch.nn.Module:

return self.model.decoder

def embed_multimodal(self, **kwargs: object) → MultiModalEmbeddings:

    audio_input = self.\_parse_and_validate_audio_input(\*\*kwargs)



    speech = audio_input\["input_features"\]

    speech_lengths = audio_input\["speech_lengths"\]

    enc_output = self.model.get_encoder_outputs(

speech=speech, speech_lengths=speech_lengths

    )

return enc_output

def embed_input_ids(

self,

input_ids: torch.Tensor,

multimodal_embeddings: MultiModalEmbeddings | None = None,

*,

is_multimodal: torch.Tensor | None = None,

handle_oov_mm_token: bool = False,

) -> torch.Tensor:

    inputs_embeds = self.model.decoder.embed_input_ids(input_ids)

return _merge_multimodal_embeddings(

inputs_embeds=inputs_embeds,

multimodal_embeddings=multimodal_embeddings,

is_multimodal=_require_is_multimodal(is_multimodal),

    )

def _parse_and_validate_audio_input(self, **kwargs: object) → FunASRAudioInputs:

    input_features = kwargs.pop("input_features", None)

    speech_lengths = kwargs.pop("speech_lengths", None)

if input_features is not None:

        input_features = json_map_leaves(lambda x: x.to(self.dtype), input_features)

if speech_lengths is not None:

        speech_lengths = json_map_leaves(lambda x: x.to(self.dtype), speech_lengths)

return FunASRAudioInputs(

input_features=input_features, speech_lengths=speech_lengths

    )

def compute_logits(self, hidden_states: torch.Tensor) → torch.Tensor:

    logits = self.logits_processor(self.lm_head, hidden_states)

return logits

def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) → set[str]:

    loader = AutoWeightsLoader(

self,

    )

# add fake zeros bias for k_proj to state_dict

    weights = \_create_fake_bias_for_k_proj(weights)

return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)

Your implementation of SupportsTranscription and the required methods appears correct, but if /v1/audio/transcriptions is still not exposed, the most common causes are:

  1. Model Registry: Ensure your model class (FunASRForConditionalGeneration) is actually registered in the vLLM model registry and is being loaded by the server. If the model is not discoverable, the endpoint will not be enabled (docs).

  2. Class Attributes: Double-check that supports_transcription = True is set as a class attribute (not just supports_transcription_only). This is required for vLLM to detect transcription support (API reference).

  3. Server Logs: Check the server logs at startup for warnings or errors about model registration, missing attributes, or interface issues. If the model is not listed as supporting transcription, the endpoint will not be exposed.

  4. Supported Languages: Make sure supported_languages is a valid mapping of ISO 639-1 codes to language names, and that all codes are valid (docs).

If all of the above are correct and the endpoint is still not available, please provide the server startup logs for further diagnosis. Would you like a checklist or example of a minimal working model class?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM still not work. how can I check if my model supports transcription during startup

To check if your model supports transcription during startup, look for log messages indicating that the model supports the “transcription” task or that the /v1/audio/transcriptions endpoint is enabled. vLLM logs the supported tasks for each model at startup; if “transcription” is not listed, your model is not recognized as supporting it (docs). Also, you can programmatically check with supports_transcription(model) in a Python shell (API reference).

If you do not see “transcription” in the logs or the endpoint is missing, your model is not correctly registered or does not fully implement the required interface. Would you like instructions on how to add a runtime check or interpret the logs?

Sources:

Hint: Mention @RunLLM in the post for followups.