How to expose v1/audio/transcriptions router for custom models

It seems likely that in PIECEWISE mode, vLLM splits the computation graph at attention or other cudagraph-incompatible operations, capturing cudagraphs only for the submodules that can handle dynamic shapes. For pooling models, which accept variable-length input tokens, this means cudagraphs are captured for compatible segments, while attention or pooling logic that depends on dynamic input remains in eager mode (torch_compile docs, cuda_graphs design).

This approach allows pooling models to process dynamic input lengths efficiently, leveraging cudagraph acceleration where possible, but without requiring input padding or full static shapes. Would you like a step-by-step breakdown or a diagram reference?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM cudagraph needs input shape to be fixed right?why piecewise can support dynamic shapes

Yes, CUDA Graphs require fixed input shapes for each captured graph. PIECEWISE mode in vLLM works with dynamic shapes by capturing separate cudagraphs for each unique input shape (e.g., batch size or sequence length) encountered at runtime, but only for submodules that are compatible with cudagraphs (torch_compile docs, cuda_graphs design).

For pooling models, vLLM does not pad inputs to a fixed shape. Instead, it manages a cache of cudagraphs for the specific shapes seen so far, replaying the appropriate cudagraph when the same shape is encountered again. If a new shape appears, vLLM falls back to eager execution or captures a new cudagraph for that shape. This allows efficient execution for dynamic input lengths without requiring all inputs to be padded or fixed.

Would you like a step-by-step explanation or a diagram reference?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM how does embedding models remove paddings when batchsize > 1

vLLM embedding (pooling) models handle variable-length inputs in a batch by not requiring explicit padding: each input sequence is processed according to its true length, and padding tokens are ignored during pooling (e.g., mean, CLS, or LAST pooling). The pooling logic uses attention masks or sequence lengths to exclude padding tokens from the final embedding calculation (docs, PR discussion).

So, when batch size > 1, vLLM internally tracks the actual length of each sequence and ensures that only valid (non-padding) tokens contribute to the embedding. No explicit user-side padding removal is needed.

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM more details and does embedding models support prefix caching?

vLLM embedding (pooling) models do not currently support prefix caching. Prefix caching is designed for generation tasks and is not enabled for pooling/embedding endpoints, as embeddings require the full prompt’s hidden states and cannot reuse partial computation (docs, issue). When you use the pooling/embedding API, vLLM always recomputes the entire input sequence.

Would you like more detail on why prefix caching is not supported for embeddings?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM more details on vllm handling about dynamic input token length on embedding models

vLLM handles dynamic input token lengths for embedding (pooling) models by supporting variable-length sequences within each batch—there is no requirement for all inputs to be padded to the same length. Internally, vLLM tracks the true length of each input and uses attention masks or sequence length metadata to ensure that only valid (non-padding) tokens are included in the embedding calculation (docs, PR discussion).

This design allows efficient batching and avoids unnecessary computation on padding tokens. The pooling operation (e.g., MEAN, CLS, LAST) is applied only to the actual tokens, and chunked processing is used for very long inputs, further supporting flexible input lengths without loss of accuracy.

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM will chunked input affect accuracy?

Chunked input can slightly affect embedding accuracy, especially for models not designed for long contexts or when using pooling types other than MEAN. However, benchmarks show that for many models (e.g., multilingual-e5-large, jina-embeddings-v3), chunked processing maintains comparable performance to native long-context support as long as the context extension is not extreme (PR discussion).

Would you like to see benchmark results or more technical details?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM without chunked processing, does vllm need padding on variable-length input

No, vLLM does not require padding for variable-length input when chunked processing is not used. vLLM natively supports batching of variable-length sequences for embedding (pooling) models, and each input is processed according to its true length without explicit padding (docs, PR discussion).

Would you like more detail on how vLLM batches variable-length inputs?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM how to return input_id in openai response

It seems likely that vLLM’s OpenAI-compatible API does not natively return input_id in the response for embedding or completion endpoints. The standard OpenAI API response does not include an input_id field, and there is no documented vLLM extension for this (OpenAI API reference, vLLM docs).

If you need to associate responses with input IDs, you may need to track them client-side or use the order of returned results. Would you like suggestions for workarounds?

Sources:

Hint: Mention @RunLLM in the post for followups.