How to expose v1/audio/transcriptions router for custom models

Juelianqvq · February 4, 2026, 5:58am

how to expose v1/audio/transcriptions router for custom models

RunLLM · February 4, 2026, 5:58am

To expose the /v1/audio/transcriptions endpoint for a custom model in vLLM, your model must implement the SupportsTranscription interface and be registered in the model registry. You need to define required class attributes and methods such as supported_languages, supports_transcription_only, get_speech_to_text_config, and get_generation_prompt. Once these are implemented, vLLM will automatically advertise transcription support and initialize the endpoint—no extra registration is needed beyond this interface implementation. See docs and example code.

If your model does not implement SupportsTranscription, the endpoint will return an error stating the model does not support the Transcriptions API. For multimodal models, follow the multimodal prompt construction pattern as described in the documentation. Would you like a step-by-step code example?

Sources:

Hint: Mention @RunLLM in the post for followups.

Juelianqvq · February 4, 2026, 6:15am

@RunLLM adding SupportsTranscription and required functions still don’t work

RunLLM · February 4, 2026, 6:16am

If you’ve added SupportsTranscription and the required functions but /v1/audio/transcriptions still doesn’t work, ensure your model is properly registered in the vLLM model registry and that all required class attributes and methods are correctly implemented. Also, verify that your server is running with the correct model and that the model advertises transcription support (i.e., supports_transcription = True). If these are set, vLLM should automatically expose the endpoint—no manual router registration is needed. See docs.

If the endpoint is still missing or returns an error, check server logs for import or registration issues, and confirm your model is listed in the supported models for transcription. Some models may require additional configuration or dependencies (e.g., audio libraries). Would you like troubleshooting steps or example code?

Sources:

docs/contributing/model/transcription.md

Hint: Mention @RunLLM in the post for followups.

Juelianqvq · February 4, 2026, 6:18am

@RunLLM here is my code
@MULTIMODAL_REGISTRY.register_processor(

FunASRMultiModalProcessor,

info=FunASRProcessingInfo,

dummy_inputs=FunASRDummyInputsBuilder,

)

class FunASRForConditionalGeneration(

nn.Module,

SupportsTranscription,

SupportsMultiModal

):

packed_modules_mapping = {

“self_attn.qkv_proj”: [

“self_attn.q_proj”,

“self_attn.k_proj”,

“self_attn.v_proj”,

\],

“encoder_attn.kv_proj”: [“encoder_attn.k_proj”, “encoder_attn.v_proj”],

}



hf_to_vllm_mapper = WeightsMapper(

orig_to_new_substr={

“linear_q.”: “q_proj.”,

“linear_k.”: “k_proj.”,

“linear_v.”: “v_proj.”,

“linear_out.”: “out_proj.”,

    }

)



supports_transcription_only = True

supports_segment_timestamp = True

supported_languages = ISO639_1_SUPPORTED_LANGS

@classmethod

def validate_language(cls, language: str | None) → str | None:

if language is None:

# TODO language should be optional and can be guessed.

# For now we default to en. See

# transformers/src/transformers/models/whisper/generation_whisper.py at main · huggingface/transformers · GitHub

        logger.warning(

"Defaulting to language=‘en’. If you wish to transcribe "

"audio in a different language, pass the `language` field "

“in the TranscriptionRequest.”

        )

        language = "en"

return super().validate_language(language)

@classmethod

def get_generation_prompt(

cls,

audio: np.ndarray,

model_config: ModelConfig, # not needed here

stt_config: SpeechToTextConfig,

language: str | None,

task_type: Literal[“transcribe”, “translate”],

request_prompt: str,

to_language: str | None,

) -> PromptType:

# processor = cached_processor_from_config(model_config)

if language is None:

raise ValueError(

“Language must be specified when creating the funasr prompt”

        )



    funasr_prompt = "<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n<|im_start|>user\\n语音转写：<|AUDIO|><|im_end|>\\n<|im_start|>assistant\\n"  # noqa: E501

    prompt = {

“prompt”: funasr_prompt,

“multi_modal_data”: {

“audio”: (audio, stt_config.sample_rate),

        },

    }

return cast(PromptType, prompt)

@classmethod

def get_speech_to_text_config(

cls, model_config: ModelConfig, task_type: str

) -> SpeechToTextConfig:

    processor = cached_processor_from_config(model_config)

return SpeechToTextConfig(

max_audio_clip_s=processor.feature_extractor.chunk_length,

sample_rate=processor.feature_extractor.sampling_rate,

@classmethod

def get_num_audio_tokens(

cls,

audio_duration_s: float,

stt_config: SpeechToTextConfig,

model_config: ModelConfig,

) -> int | None:

    processor = cached_processor_from_config(model_config)

    hop_length = processor.feature_extractor.hop_length

assert hop_length is not None

return math.ceil(audio_duration_s * stt_config.sample_rate / hop_length)

def _init_(self, *, vllm_config: VllmConfig, prefix: str = “”):

super()._init_()

    config = vllm_config.model_config.hf_config

    quant_config = vllm_config.quant_config

self.config = config

self.dtype = vllm_config.model_config.dtype

self.model = FunASRModel(

vllm_config=vllm_config,

prefix=maybe_prefix(prefix, “model”),

    )

    logit_scale = getattr(config, "logit_scale", 1.0)

if config.tie_word_embeddings:

self.lm_head = self.model.decoder.embed_tokens

else:

self.lm_head = ParallelLMHead(

            config.vocab_size,

            config.hidden_size,

quant_config=quant_config,

prefix=maybe_prefix(prefix, “lm_head”),

self.logits_processor = LogitsProcessor(config.vocab_size, scale=logit_scale)

def forward(

self,

input_ids: torch.Tensor,

positions: torch.Tensor,

inputs_embeds: torch.Tensor | None = None,

**kwargs,

) -> torch.Tensor:

    decoder_outputs = self.model(

input_ids=input_ids,

positions=positions,

inputs_embeds=inputs_embeds,

return decoder_outputs

def get_language_model(self) → torch.nn.Module:

return self.model.decoder

def embed_multimodal(self, **kwargs: object) → MultiModalEmbeddings:

    audio_input = self.\_parse_and_validate_audio_input(\*\*kwargs)



    speech = audio_input\["input_features"\]

    speech_lengths = audio_input\["speech_lengths"\]

    enc_output = self.model.get_encoder_outputs(

speech=speech, speech_lengths=speech_lengths

return enc_output

def embed_input_ids(

self,

input_ids: torch.Tensor,

multimodal_embeddings: MultiModalEmbeddings | None = None,

*,

is_multimodal: torch.Tensor | None = None,

handle_oov_mm_token: bool = False,

) -> torch.Tensor:

    inputs_embeds = self.model.decoder.embed_input_ids(input_ids)

return _merge_multimodal_embeddings(

inputs_embeds=inputs_embeds,

multimodal_embeddings=multimodal_embeddings,

is_multimodal=_require_is_multimodal(is_multimodal),

def _parse_and_validate_audio_input(self, **kwargs: object) → FunASRAudioInputs:

    input_features = kwargs.pop("input_features", None)

    speech_lengths = kwargs.pop("speech_lengths", None)

if input_features is not None:

        input_features = json_map_leaves(lambda x: x.to(self.dtype), input_features)

if speech_lengths is not None:

        speech_lengths = json_map_leaves(lambda x: x.to(self.dtype), speech_lengths)

return FunASRAudioInputs(

input_features=input_features, speech_lengths=speech_lengths

def compute_logits(self, hidden_states: torch.Tensor) → torch.Tensor:

    logits = self.logits_processor(self.lm_head, hidden_states)

return logits

def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) → set[str]:

    loader = AutoWeightsLoader(

self,

# add fake zeros bias for k_proj to state_dict

    weights = \_create_fake_bias_for_k_proj(weights)

return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)

RunLLM · February 4, 2026, 6:19am

Your implementation of SupportsTranscription and the required methods appears correct, but if /v1/audio/transcriptions is still not exposed, the most common causes are:

Model Registry: Ensure your model class (FunASRForConditionalGeneration) is actually registered in the vLLM model registry and is being loaded by the server. If the model is not discoverable, the endpoint will not be enabled (docs).
Class Attributes: Double-check that supports_transcription = True is set as a class attribute (not just supports_transcription_only). This is required for vLLM to detect transcription support (API reference).
Server Logs: Check the server logs at startup for warnings or errors about model registration, missing attributes, or interface issues. If the model is not listed as supporting transcription, the endpoint will not be exposed.
Supported Languages: Make sure supported_languages is a valid mapping of ISO 639-1 codes to language names, and that all codes are valid (docs).

If all of the above are correct and the endpoint is still not available, please provide the server startup logs for further diagnosis. Would you like a checklist or example of a minimal working model class?

Sources:

Hint: Mention @RunLLM in the post for followups.

Juelianqvq · February 4, 2026, 6:25am

@RunLLM still not work. how can I check if my model supports transcription during startup

RunLLM · February 4, 2026, 6:25am

To check if your model supports transcription during startup, look for log messages indicating that the model supports the “transcription” task or that the /v1/audio/transcriptions endpoint is enabled. vLLM logs the supported tasks for each model at startup; if “transcription” is not listed, your model is not recognized as supporting it (docs). Also, you can programmatically check with supports_transcription(model) in a Python shell (API reference).

If you do not see “transcription” in the logs or the endpoint is missing, your model is not correctly registered or does not fully implement the required interface. Would you like instructions on how to add a runtime check or interpret the logs?

Sources:

docs/contributing/model/transcription.md

Hint: Mention @RunLLM in the post for followups.

Juelianqvq · February 26, 2026, 2:52am

@RunLLM does vllm support deploy multiple embedding models on a single gpu?what’s the command

RunLLM · February 26, 2026, 2:52am

vLLM does not support serving multiple embedding models on a single GPU within a single vLLM instance. The recommended approach is to run multiple vLLM server instances (each on a different port), each serving a different model, and use a routing layer to direct requests to the correct server. There is no command to load multiple models in one vLLM process (FAQ, FAQ2).

If you want to run multiple vLLM instances on the same GPU, you can try setting the --gpu-memory-utilization flag to allocate a fraction of GPU memory per instance, but this is not always reliable and may require manual tuning and staggered startup (discussion). Would you like more detail or example commands?

Sources:

Hint: Mention @RunLLM in the post for followups.

Juelianqvq · February 26, 2026, 3:00am

@RunLLM does pooling model have cudagraph and torch compile support? since the input token is not fixed

RunLLM · February 26, 2026, 3:01am

Pooling models do not support full cudagraphs in vLLM; if you attempt to enable full cudagraph mode, vLLM will override it to PIECEWISE mode (docs). Torch.compile is supported, but dynamic input tokens mean only piecewise cudagraphs are used, not full cudagraphs.

Would you like more detail on how vLLM handles pooling models and dynamic shapes?

Sources:

docs/api/vllm/config/vllm/#__span-0-811

Hint: Mention @RunLLM in the post for followups.

Juelianqvq · February 26, 2026, 3:02am

@RunLLM does it support fp32 attention backend if I want to keep accuracy

RunLLM · February 26, 2026, 3:02am

vLLM supports FP32 (float32) model weights, but most optimized attention backends (like FlashAttention) do not support FP32; instead, vLLM falls back to less optimized backends such as XFormers or Torch SDPA for FP32 (issue, docs). Using FP32 may reduce performance and is generally not recommended unless you specifically require maximum numerical accuracy.

Would you like more detail on backend selection or how to configure FP32 in vLLM?

Sources:

Hint: Mention @RunLLM in the post for followups.

Juelianqvq · February 26, 2026, 3:03am

more detail on how vLLM handles pooling models and dynamic shapes @RunLLM

Juelianqvq · February 26, 2026, 3:12am

@RunLLM more detail on how vLLM handles pooling models and dynamic shapes

RunLLM · February 26, 2026, 3:12am

Certainly! Here’s a step-by-step explanation:

Pooling Models and Dynamic Input: Pooling models (e.g., embedding models) in vLLM accept variable-length input tokens, so the input sequence length is not fixed. This dynamic shape means that the computation graph must handle a range of input sizes.
torch.compile Support: vLLM supports torch.compile for pooling models, which allows for graph-level optimizations. However, because the input token length varies, the compiled graph must be dynamic with respect to sequence length. vLLM uses dynamic shape guards to ensure the compiled graph is valid for all input lengths, but if user/model code branches on sequence length, it can cause graph breaks or constraint violations (dynamic shapes docs).
CUDAGraph Support: Full cudagraph capture is not supported for pooling models due to their variable input shapes. If you try to enable full cudagraph mode, vLLM will automatically override it to PIECEWISE mode, which only captures cudagraphs for submodules that can handle dynamic shapes (config docs). This ensures compatibility but may not provide the same performance boost as full cudagraphs.
Piecewise CUDAGraph: In PIECEWISE mode, vLLM splits the computation graph at attention or other incompatible ops, capturing cudagraphs for the compatible segments. This allows pooling models to benefit from cudagraph acceleration where possible, while still supporting dynamic input lengths (torch_compile docs).
Best Practices: For optimal performance and compatibility, avoid branching on input sequence length in your model code, and use PIECEWISE cudagraph mode (the default for pooling models). If you need to debug dynamic shape issues, use stricter dynamic shape modes as described in the debugging guide.

Sources:

Hint: Mention @RunLLM in the post for followups.

Juelianqvq · February 26, 2026, 3:15am

@RunLLM so vllm will padding tokens to certain shape if it is pooling model?

RunLLM · February 26, 2026, 3:15am

No, for pooling models, vLLM does not pad input tokens to a fixed shape for cudagraph capture. Instead, it uses PIECEWISE cudagraph mode, which allows for dynamic input lengths and only captures cudagraphs for compatible submodules (config docs). Padding to fixed shapes is not performed for pooling models, so each request can have variable sequence lengths.

Would you like more detail on how PIECEWISE mode works with dynamic shapes?

Sources:

Hint: Mention @RunLLM in the post for followups.

Juelianqvq · February 26, 2026, 3:16am

@RunLLM how PIECEWISE mode works with pooling model dynamic shapes?

Topic		Replies	Views
Compressed Multimodal embeddings inputs Multi-modality	1	53	March 18, 2026
Text to speech support with /v1/audio/speech route Model Support	1	716	November 28, 2025
Multimodal inference guideline? General	59	2797	August 6, 2025
Is the prompt parameter in the OpenAI Transcription API supported by vLLM? General	1	104	August 22, 2025
"served-model-name" and "model" General	6	1142	January 26, 2026

How to expose v1/audio/transcriptions router for custom models

Related topics