how to expose v1/audio/transcriptions router for custom models
To expose the /v1/audio/transcriptions endpoint for a custom model in vLLM, your model must implement the SupportsTranscription interface and be registered in the model registry. You need to define required class attributes and methods such as supported_languages, supports_transcription_only, get_speech_to_text_config, and get_generation_prompt. Once these are implemented, vLLM will automatically advertise transcription support and initialize the endpoint—no extra registration is needed beyond this interface implementation. See docs and example code.
If your model does not implement SupportsTranscription, the endpoint will return an error stating the model does not support the Transcriptions API. For multimodal models, follow the multimodal prompt construction pattern as described in the documentation. Would you like a step-by-step code example?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM adding SupportsTranscription and required functions still don’t work
If you’ve added SupportsTranscription and the required functions but /v1/audio/transcriptions still doesn’t work, ensure your model is properly registered in the vLLM model registry and that all required class attributes and methods are correctly implemented. Also, verify that your server is running with the correct model and that the model advertises transcription support (i.e., supports_transcription = True). If these are set, vLLM should automatically expose the endpoint—no manual router registration is needed. See docs.
If the endpoint is still missing or returns an error, check server logs for import or registration issues, and confirm your model is listed in the supported models for transcription. Some models may require additional configuration or dependencies (e.g., audio libraries). Would you like troubleshooting steps or example code?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM here is my code
@MULTIMODAL_REGISTRY.register_processor(
FunASRMultiModalProcessor,
info=FunASRProcessingInfo,
dummy_inputs=FunASRDummyInputsBuilder,
)
class FunASRForConditionalGeneration(
nn.Module,
SupportsTranscription,
SupportsMultiModal
):
packed_modules_mapping = {
“self_attn.qkv_proj”: [
“self_attn.q_proj”,
“self_attn.k_proj”,
“self_attn.v_proj”,
\],
“encoder_attn.kv_proj”: [“encoder_attn.k_proj”, “encoder_attn.v_proj”],
}
hf_to_vllm_mapper = WeightsMapper(
orig_to_new_substr={
“linear_q.”: “q_proj.”,
“linear_k.”: “k_proj.”,
“linear_v.”: “v_proj.”,
“linear_out.”: “out_proj.”,
}
)
supports_transcription_only = True
supports_segment_timestamp = True
supported_languages = ISO639_1_SUPPORTED_LANGS
@classmethod
def validate_language(cls, language: str | None) → str | None:
if language is None:
# TODO language should be optional and can be guessed.
# For now we default to en. See
logger.warning(
"Defaulting to language=‘en’. If you wish to transcribe "
"audio in a different language, pass the `language` field "
“in the TranscriptionRequest.”
)
language = "en"
return super().validate_language(language)
@classmethod
def get_generation_prompt(
cls,
audio: np.ndarray,
model_config: ModelConfig, # not needed here
stt_config: SpeechToTextConfig,
language: str | None,
task_type: Literal[“transcribe”, “translate”],
request_prompt: str,
to_language: str | None,
) -> PromptType:
# processor = cached_processor_from_config(model_config)
if language is None:
raise ValueError(
“Language must be specified when creating the funasr prompt”
)
funasr_prompt = "<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n<|im_start|>user\\n语音转写:<|AUDIO|><|im_end|>\\n<|im_start|>assistant\\n" # noqa: E501
prompt = {
“prompt”: funasr_prompt,
“multi_modal_data”: {
“audio”: (audio, stt_config.sample_rate),
},
}
return cast(PromptType, prompt)
@classmethod
def get_speech_to_text_config(
cls, model_config: ModelConfig, task_type: str
) -> SpeechToTextConfig:
processor = cached_processor_from_config(model_config)
return SpeechToTextConfig(
max_audio_clip_s=processor.feature_extractor.chunk_length,
sample_rate=processor.feature_extractor.sampling_rate,
)
@classmethod
def get_num_audio_tokens(
cls,
audio_duration_s: float,
stt_config: SpeechToTextConfig,
model_config: ModelConfig,
) -> int | None:
processor = cached_processor_from_config(model_config)
hop_length = processor.feature_extractor.hop_length
assert hop_length is not None
return math.ceil(audio_duration_s * stt_config.sample_rate / hop_length)
def _init_(self, *, vllm_config: VllmConfig, prefix: str = “”):
super()._init_()
config = vllm_config.model_config.hf_config
quant_config = vllm_config.quant_config
self.config = config
self.dtype = vllm_config.model_config.dtype
self.model = FunASRModel(
vllm_config=vllm_config,
prefix=maybe_prefix(prefix, “model”),
)
logit_scale = getattr(config, "logit_scale", 1.0)
if config.tie_word_embeddings:
self.lm_head = self.model.decoder.embed_tokens
else:
self.lm_head = ParallelLMHead(
config.vocab_size,
config.hidden_size,
quant_config=quant_config,
prefix=maybe_prefix(prefix, “lm_head”),
)
self.logits_processor = LogitsProcessor(config.vocab_size, scale=logit_scale)
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
inputs_embeds: torch.Tensor | None = None,
**kwargs,
) -> torch.Tensor:
decoder_outputs = self.model(
input_ids=input_ids,
positions=positions,
inputs_embeds=inputs_embeds,
)
return decoder_outputs
def get_language_model(self) → torch.nn.Module:
return self.model.decoder
def embed_multimodal(self, **kwargs: object) → MultiModalEmbeddings:
audio_input = self.\_parse_and_validate_audio_input(\*\*kwargs)
speech = audio_input\["input_features"\]
speech_lengths = audio_input\["speech_lengths"\]
enc_output = self.model.get_encoder_outputs(
speech=speech, speech_lengths=speech_lengths
)
return enc_output
def embed_input_ids(
self,
input_ids: torch.Tensor,
multimodal_embeddings: MultiModalEmbeddings | None = None,
*,
is_multimodal: torch.Tensor | None = None,
handle_oov_mm_token: bool = False,
) -> torch.Tensor:
inputs_embeds = self.model.decoder.embed_input_ids(input_ids)
return _merge_multimodal_embeddings(
inputs_embeds=inputs_embeds,
multimodal_embeddings=multimodal_embeddings,
is_multimodal=_require_is_multimodal(is_multimodal),
)
def _parse_and_validate_audio_input(self, **kwargs: object) → FunASRAudioInputs:
input_features = kwargs.pop("input_features", None)
speech_lengths = kwargs.pop("speech_lengths", None)
if input_features is not None:
input_features = json_map_leaves(lambda x: x.to(self.dtype), input_features)
if speech_lengths is not None:
speech_lengths = json_map_leaves(lambda x: x.to(self.dtype), speech_lengths)
return FunASRAudioInputs(
input_features=input_features, speech_lengths=speech_lengths
)
def compute_logits(self, hidden_states: torch.Tensor) → torch.Tensor:
logits = self.logits_processor(self.lm_head, hidden_states)
return logits
def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) → set[str]:
loader = AutoWeightsLoader(
self,
)
# add fake zeros bias for k_proj to state_dict
weights = \_create_fake_bias_for_k_proj(weights)
return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
Your implementation of SupportsTranscription and the required methods appears correct, but if /v1/audio/transcriptions is still not exposed, the most common causes are:
-
Model Registry: Ensure your model class (
FunASRForConditionalGeneration) is actually registered in the vLLM model registry and is being loaded by the server. If the model is not discoverable, the endpoint will not be enabled (docs). -
Class Attributes: Double-check that
supports_transcription = Trueis set as a class attribute (not justsupports_transcription_only). This is required for vLLM to detect transcription support (API reference). -
Server Logs: Check the server logs at startup for warnings or errors about model registration, missing attributes, or interface issues. If the model is not listed as supporting transcription, the endpoint will not be exposed.
-
Supported Languages: Make sure
supported_languagesis a valid mapping of ISO 639-1 codes to language names, and that all codes are valid (docs).
If all of the above are correct and the endpoint is still not available, please provide the server startup logs for further diagnosis. Would you like a checklist or example of a minimal working model class?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM still not work. how can I check if my model supports transcription during startup
To check if your model supports transcription during startup, look for log messages indicating that the model supports the “transcription” task or that the /v1/audio/transcriptions endpoint is enabled. vLLM logs the supported tasks for each model at startup; if “transcription” is not listed, your model is not recognized as supporting it (docs). Also, you can programmatically check with supports_transcription(model) in a Python shell (API reference).
If you do not see “transcription” in the logs or the endpoint is missing, your model is not correctly registered or does not fully implement the required interface. Would you like instructions on how to add a runtime check or interpret the logs?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM does vllm support deploy multiple embedding models on a single gpu?what’s the command
vLLM does not support serving multiple embedding models on a single GPU within a single vLLM instance. The recommended approach is to run multiple vLLM server instances (each on a different port), each serving a different model, and use a routing layer to direct requests to the correct server. There is no command to load multiple models in one vLLM process (FAQ, FAQ2).
If you want to run multiple vLLM instances on the same GPU, you can try setting the --gpu-memory-utilization flag to allocate a fraction of GPU memory per instance, but this is not always reliable and may require manual tuning and staggered startup (discussion). Would you like more detail or example commands?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM does pooling model have cudagraph and torch compile support? since the input token is not fixed
Pooling models do not support full cudagraphs in vLLM; if you attempt to enable full cudagraph mode, vLLM will override it to PIECEWISE mode (docs). Torch.compile is supported, but dynamic input tokens mean only piecewise cudagraphs are used, not full cudagraphs.
Would you like more detail on how vLLM handles pooling models and dynamic shapes?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM does it support fp32 attention backend if I want to keep accuracy
vLLM supports FP32 (float32) model weights, but most optimized attention backends (like FlashAttention) do not support FP32; instead, vLLM falls back to less optimized backends such as XFormers or Torch SDPA for FP32 (issue, docs). Using FP32 may reduce performance and is generally not recommended unless you specifically require maximum numerical accuracy.
Would you like more detail on backend selection or how to configure FP32 in vLLM?
Sources:
Hint: Mention @RunLLM in the post for followups.
more detail on how vLLM handles pooling models and dynamic shapes @RunLLM
@RunLLM more detail on how vLLM handles pooling models and dynamic shapes
Certainly! Here’s a step-by-step explanation:
-
Pooling Models and Dynamic Input: Pooling models (e.g., embedding models) in vLLM accept variable-length input tokens, so the input sequence length is not fixed. This dynamic shape means that the computation graph must handle a range of input sizes.
-
torch.compile Support: vLLM supports
torch.compilefor pooling models, which allows for graph-level optimizations. However, because the input token length varies, the compiled graph must be dynamic with respect to sequence length. vLLM uses dynamic shape guards to ensure the compiled graph is valid for all input lengths, but if user/model code branches on sequence length, it can cause graph breaks or constraint violations (dynamic shapes docs). -
CUDAGraph Support: Full cudagraph capture is not supported for pooling models due to their variable input shapes. If you try to enable full cudagraph mode, vLLM will automatically override it to PIECEWISE mode, which only captures cudagraphs for submodules that can handle dynamic shapes (config docs). This ensures compatibility but may not provide the same performance boost as full cudagraphs.
-
Piecewise CUDAGraph: In PIECEWISE mode, vLLM splits the computation graph at attention or other incompatible ops, capturing cudagraphs for the compatible segments. This allows pooling models to benefit from cudagraph acceleration where possible, while still supporting dynamic input lengths (torch_compile docs).
-
Best Practices: For optimal performance and compatibility, avoid branching on input sequence length in your model code, and use PIECEWISE cudagraph mode (the default for pooling models). If you need to debug dynamic shape issues, use stricter dynamic shape modes as described in the debugging guide.
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM so vllm will padding tokens to certain shape if it is pooling model?
No, for pooling models, vLLM does not pad input tokens to a fixed shape for cudagraph capture. Instead, it uses PIECEWISE cudagraph mode, which allows for dynamic input lengths and only captures cudagraphs for compatible submodules (config docs). Padding to fixed shapes is not performed for pooling models, so each request can have variable sequence lengths.
Would you like more detail on how PIECEWISE mode works with dynamic shapes?
Sources:
- https://github.com/vllm-project/vllm/issues/28207
- https://docs.vllm.ai/en/latest/design/cuda_graphs/#cudagraphmodes
Hint: Mention @RunLLM in the post for followups.
@RunLLM how PIECEWISE mode works with pooling model dynamic shapes?