How to expose v1/audio/transcriptions router for custom models

RunLLM · February 26, 2026, 3:16am

It seems likely that in PIECEWISE mode, vLLM splits the computation graph at attention or other cudagraph-incompatible operations, capturing cudagraphs only for the submodules that can handle dynamic shapes. For pooling models, which accept variable-length input tokens, this means cudagraphs are captured for compatible segments, while attention or pooling logic that depends on dynamic input remains in eager mode (torch_compile docs, cuda_graphs design).

This approach allows pooling models to process dynamic input lengths efficiently, leveraging cudagraph acceleration where possible, but without requiring input padding or full static shapes. Would you like a step-by-step breakdown or a diagram reference?

Sources:

Hint: Mention @RunLLM in the post for followups.

Juelianqvq · February 26, 2026, 3:18am

@RunLLM cudagraph needs input shape to be fixed right?why piecewise can support dynamic shapes

RunLLM · February 26, 2026, 3:18am

Yes, CUDA Graphs require fixed input shapes for each captured graph. PIECEWISE mode in vLLM works with dynamic shapes by capturing separate cudagraphs for each unique input shape (e.g., batch size or sequence length) encountered at runtime, but only for submodules that are compatible with cudagraphs (torch_compile docs, cuda_graphs design).

For pooling models, vLLM does not pad inputs to a fixed shape. Instead, it manages a cache of cudagraphs for the specific shapes seen so far, replaying the appropriate cudagraph when the same shape is encountered again. If a new shape appears, vLLM falls back to eager execution or captures a new cudagraph for that shape. This allows efficient execution for dynamic input lengths without requiring all inputs to be padded or fixed.

Would you like a step-by-step explanation or a diagram reference?

Sources:

Hint: Mention @RunLLM in the post for followups.

Juelianqvq · February 27, 2026, 9:12am

@RunLLM how does embedding models remove paddings when batchsize > 1

RunLLM · February 27, 2026, 9:13am

vLLM embedding (pooling) models handle variable-length inputs in a batch by not requiring explicit padding: each input sequence is processed according to its true length, and padding tokens are ignored during pooling (e.g., mean, CLS, or LAST pooling). The pooling logic uses attention masks or sequence lengths to exclude padding tokens from the final embedding calculation (docs, PR discussion).

So, when batch size > 1, vLLM internally tracks the actual length of each sequence and ensures that only valid (non-padding) tokens contribute to the embedding. No explicit user-side padding removal is needed.

Sources:

Hint: Mention @RunLLM in the post for followups.

Juelianqvq · February 27, 2026, 9:15am

@RunLLM more details and does embedding models support prefix caching?

RunLLM · February 27, 2026, 9:15am

vLLM embedding (pooling) models do not currently support prefix caching. Prefix caching is designed for generation tasks and is not enabled for pooling/embedding endpoints, as embeddings require the full prompt’s hidden states and cannot reuse partial computation (docs, issue). When you use the pooling/embedding API, vLLM always recomputes the entire input sequence.

Would you like more detail on why prefix caching is not supported for embeddings?

Sources:

Hint: Mention @RunLLM in the post for followups.

Juelianqvq · February 27, 2026, 9:17am

@RunLLM more details on vllm handling about dynamic input token length on embedding models

RunLLM · February 27, 2026, 9:17am

vLLM handles dynamic input token lengths for embedding (pooling) models by supporting variable-length sequences within each batch—there is no requirement for all inputs to be padded to the same length. Internally, vLLM tracks the true length of each input and uses attention masks or sequence length metadata to ensure that only valid (non-padding) tokens are included in the embedding calculation (docs, PR discussion).

This design allows efficient batching and avoids unnecessary computation on padding tokens. The pooling operation (e.g., MEAN, CLS, LAST) is applied only to the actual tokens, and chunked processing is used for very long inputs, further supporting flexible input lengths without loss of accuracy.

Sources:

Hint: Mention @RunLLM in the post for followups.

Juelianqvq · February 27, 2026, 9:25am

@RunLLM will chunked input affect accuracy?

RunLLM · February 27, 2026, 9:25am

Chunked input can slightly affect embedding accuracy, especially for models not designed for long contexts or when using pooling types other than MEAN. However, benchmarks show that for many models (e.g., multilingual-e5-large, jina-embeddings-v3), chunked processing maintains comparable performance to native long-context support as long as the context extension is not extreme (PR discussion).

Would you like to see benchmark results or more technical details?

Sources:

PR: Add chunked processing to handle long inputs in embedding models

Hint: Mention @RunLLM in the post for followups.

Juelianqvq · February 27, 2026, 9:49am

@RunLLM without chunked processing, does vllm need padding on variable-length input

RunLLM · February 27, 2026, 9:49am

No, vLLM does not require padding for variable-length input when chunked processing is not used. vLLM natively supports batching of variable-length sequences for embedding (pooling) models, and each input is processed according to its true length without explicit padding (docs, PR discussion).

Would you like more detail on how vLLM batches variable-length inputs?

Sources:

Hint: Mention @RunLLM in the post for followups.

Juelianqvq · March 3, 2026, 7:55am

@RunLLM how to return input_id in openai response

RunLLM · March 3, 2026, 7:55am

It seems likely that vLLM’s OpenAI-compatible API does not natively return input_id in the response for embedding or completion endpoints. The standard OpenAI API response does not include an input_id field, and there is no documented vLLM extension for this (OpenAI API reference, vLLM docs).

If you need to associate responses with input IDs, you may need to track them client-side or use the order of returned results. Would you like suggestions for workarounds?

Sources:

https://github.com/vllm-project/vllm/issues/9593

Hint: Mention @RunLLM in the post for followups.

Juelianqvq · March 19, 2026, 9:55am

@RunLLM how to feed video mp4 file to multimodal models?

RunLLM · March 19, 2026, 9:56am

To feed a video (mp4) file to multimodal models in vLLM, use the OpenAI-compatible API with a message content block of type "video_url" and provide the video file as a URL or as a base64-encoded data URL. For local files, use a file:// URL and ensure the server is started with --allowed-local-media-path. Example:

messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": "What's in this video?"},
        {"type": "video_url", "video_url": {"url": "file:///path/to/video.mp4"}}
    ]
}]

Or, for remote files:

{"type": "video_url", "video_url": {"url": "http://example.com/video.mp4"}}

See docs and example.

Would you like a full code example or details on base64 encoding?

Sources:

Hint: Mention @RunLLM in the post for followups.

Juelianqvq · March 20, 2026, 8:13am

@RunLLM I want to use mp4 file not url

RunLLM · March 20, 2026, 8:13am

To use a local mp4 file as input for a multimodal model in vLLM, start your server with the --allowed-local-media-path flag set to the directory containing your video. Then, in your API request, specify the video using a file:// URL, e.g., "url": "file:///path/to/video.mp4". This allows vLLM to securely access local files (docs, API example).

Example server command:

vllm serve <model> --allowed-local-media-path /path/to/

Example API request:

{
  "role": "user",
  "content": [
    {"type": "text", "text": "Describe the video."},
    {"type": "video_url", "video_url": {"url": "file:///path/to/video.mp4"}}
  ]
}

Would you like a Python client example?

Sources:

docs/features/multimodal_inputs.md

Hint: Mention @RunLLM in the post for followups.

Juelianqvq · March 20, 2026, 8:26am

@RunLLM I also need metadata

Topic		Replies	Views
Compressed Multimodal embeddings inputs Multi-modality	1	53	March 18, 2026
Text to speech support with /v1/audio/speech route Model Support	1	717	November 28, 2025
Multimodal inference guideline? General	59	2799	August 6, 2025
Is the prompt parameter in the OpenAI Transcription API supported by vLLM? General	1	104	August 22, 2025
"served-model-name" and "model" General	6	1143	January 26, 2026

How to expose v1/audio/transcriptions router for custom models

Related topics