how to expose v1/audio/transcriptions router for custom models
To expose the /v1/audio/transcriptions endpoint for a custom model in vLLM, your model must implement the SupportsTranscription interface and be registered in the model registry. You need to define required class attributes and methods such as supported_languages, supports_transcription_only, get_speech_to_text_config, and get_generation_prompt. Once these are implemented, vLLM will automatically advertise transcription support and initialize the endpoint—no extra registration is needed beyond this interface implementation. See docs and example code.
If your model does not implement SupportsTranscription, the endpoint will return an error stating the model does not support the Transcriptions API. For multimodal models, follow the multimodal prompt construction pattern as described in the documentation. Would you like a step-by-step code example?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM adding SupportsTranscription and required functions still don’t work
If you’ve added SupportsTranscription and the required functions but /v1/audio/transcriptions still doesn’t work, ensure your model is properly registered in the vLLM model registry and that all required class attributes and methods are correctly implemented. Also, verify that your server is running with the correct model and that the model advertises transcription support (i.e., supports_transcription = True). If these are set, vLLM should automatically expose the endpoint—no manual router registration is needed. See docs.
If the endpoint is still missing or returns an error, check server logs for import or registration issues, and confirm your model is listed in the supported models for transcription. Some models may require additional configuration or dependencies (e.g., audio libraries). Would you like troubleshooting steps or example code?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM here is my code
@MULTIMODAL_REGISTRY.register_processor(
FunASRMultiModalProcessor,
info=FunASRProcessingInfo,
dummy_inputs=FunASRDummyInputsBuilder,
)
class FunASRForConditionalGeneration(
nn.Module,
SupportsTranscription,
SupportsMultiModal
):
packed_modules_mapping = {
“self_attn.qkv_proj”: [
“self_attn.q_proj”,
“self_attn.k_proj”,
“self_attn.v_proj”,
\],
“encoder_attn.kv_proj”: [“encoder_attn.k_proj”, “encoder_attn.v_proj”],
}
hf_to_vllm_mapper = WeightsMapper(
orig_to_new_substr={
“linear_q.”: “q_proj.”,
“linear_k.”: “k_proj.”,
“linear_v.”: “v_proj.”,
“linear_out.”: “out_proj.”,
}
)
supports_transcription_only = True
supports_segment_timestamp = True
supported_languages = ISO639_1_SUPPORTED_LANGS
@classmethod
def validate_language(cls, language: str | None) → str | None:
if language is None:
# TODO language should be optional and can be guessed.
# For now we default to en. See
logger.warning(
"Defaulting to language=‘en’. If you wish to transcribe "
"audio in a different language, pass the `language` field "
“in the TranscriptionRequest.”
)
language = "en"
return super().validate_language(language)
@classmethod
def get_generation_prompt(
cls,
audio: np.ndarray,
model_config: ModelConfig, # not needed here
stt_config: SpeechToTextConfig,
language: str | None,
task_type: Literal[“transcribe”, “translate”],
request_prompt: str,
to_language: str | None,
) -> PromptType:
# processor = cached_processor_from_config(model_config)
if language is None:
raise ValueError(
“Language must be specified when creating the funasr prompt”
)
funasr_prompt = "<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n<|im_start|>user\\n语音转写:<|AUDIO|><|im_end|>\\n<|im_start|>assistant\\n" # noqa: E501
prompt = {
“prompt”: funasr_prompt,
“multi_modal_data”: {
“audio”: (audio, stt_config.sample_rate),
},
}
return cast(PromptType, prompt)
@classmethod
def get_speech_to_text_config(
cls, model_config: ModelConfig, task_type: str
) -> SpeechToTextConfig:
processor = cached_processor_from_config(model_config)
return SpeechToTextConfig(
max_audio_clip_s=processor.feature_extractor.chunk_length,
sample_rate=processor.feature_extractor.sampling_rate,
)
@classmethod
def get_num_audio_tokens(
cls,
audio_duration_s: float,
stt_config: SpeechToTextConfig,
model_config: ModelConfig,
) -> int | None:
processor = cached_processor_from_config(model_config)
hop_length = processor.feature_extractor.hop_length
assert hop_length is not None
return math.ceil(audio_duration_s * stt_config.sample_rate / hop_length)
def _init_(self, *, vllm_config: VllmConfig, prefix: str = “”):
super()._init_()
config = vllm_config.model_config.hf_config
quant_config = vllm_config.quant_config
self.config = config
self.dtype = vllm_config.model_config.dtype
self.model = FunASRModel(
vllm_config=vllm_config,
prefix=maybe_prefix(prefix, “model”),
)
logit_scale = getattr(config, "logit_scale", 1.0)
if config.tie_word_embeddings:
self.lm_head = self.model.decoder.embed_tokens
else:
self.lm_head = ParallelLMHead(
config.vocab_size,
config.hidden_size,
quant_config=quant_config,
prefix=maybe_prefix(prefix, “lm_head”),
)
self.logits_processor = LogitsProcessor(config.vocab_size, scale=logit_scale)
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
inputs_embeds: torch.Tensor | None = None,
**kwargs,
) -> torch.Tensor:
decoder_outputs = self.model(
input_ids=input_ids,
positions=positions,
inputs_embeds=inputs_embeds,
)
return decoder_outputs
def get_language_model(self) → torch.nn.Module:
return self.model.decoder
def embed_multimodal(self, **kwargs: object) → MultiModalEmbeddings:
audio_input = self.\_parse_and_validate_audio_input(\*\*kwargs)
speech = audio_input\["input_features"\]
speech_lengths = audio_input\["speech_lengths"\]
enc_output = self.model.get_encoder_outputs(
speech=speech, speech_lengths=speech_lengths
)
return enc_output
def embed_input_ids(
self,
input_ids: torch.Tensor,
multimodal_embeddings: MultiModalEmbeddings | None = None,
*,
is_multimodal: torch.Tensor | None = None,
handle_oov_mm_token: bool = False,
) -> torch.Tensor:
inputs_embeds = self.model.decoder.embed_input_ids(input_ids)
return _merge_multimodal_embeddings(
inputs_embeds=inputs_embeds,
multimodal_embeddings=multimodal_embeddings,
is_multimodal=_require_is_multimodal(is_multimodal),
)
def _parse_and_validate_audio_input(self, **kwargs: object) → FunASRAudioInputs:
input_features = kwargs.pop("input_features", None)
speech_lengths = kwargs.pop("speech_lengths", None)
if input_features is not None:
input_features = json_map_leaves(lambda x: x.to(self.dtype), input_features)
if speech_lengths is not None:
speech_lengths = json_map_leaves(lambda x: x.to(self.dtype), speech_lengths)
return FunASRAudioInputs(
input_features=input_features, speech_lengths=speech_lengths
)
def compute_logits(self, hidden_states: torch.Tensor) → torch.Tensor:
logits = self.logits_processor(self.lm_head, hidden_states)
return logits
def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) → set[str]:
loader = AutoWeightsLoader(
self,
)
# add fake zeros bias for k_proj to state_dict
weights = \_create_fake_bias_for_k_proj(weights)
return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
Your implementation of SupportsTranscription and the required methods appears correct, but if /v1/audio/transcriptions is still not exposed, the most common causes are:
-
Model Registry: Ensure your model class (
FunASRForConditionalGeneration) is actually registered in the vLLM model registry and is being loaded by the server. If the model is not discoverable, the endpoint will not be enabled (docs). -
Class Attributes: Double-check that
supports_transcription = Trueis set as a class attribute (not justsupports_transcription_only). This is required for vLLM to detect transcription support (API reference). -
Server Logs: Check the server logs at startup for warnings or errors about model registration, missing attributes, or interface issues. If the model is not listed as supporting transcription, the endpoint will not be exposed.
-
Supported Languages: Make sure
supported_languagesis a valid mapping of ISO 639-1 codes to language names, and that all codes are valid (docs).
If all of the above are correct and the endpoint is still not available, please provide the server startup logs for further diagnosis. Would you like a checklist or example of a minimal working model class?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM still not work. how can I check if my model supports transcription during startup
To check if your model supports transcription during startup, look for log messages indicating that the model supports the “transcription” task or that the /v1/audio/transcriptions endpoint is enabled. vLLM logs the supported tasks for each model at startup; if “transcription” is not listed, your model is not recognized as supporting it (docs). Also, you can programmatically check with supports_transcription(model) in a Python shell (API reference).
If you do not see “transcription” in the logs or the endpoint is missing, your model is not correctly registered or does not fully implement the required interface. Would you like instructions on how to add a runtime check or interpret the logs?
Sources:
Hint: Mention @RunLLM in the post for followups.