Text to speech support with /v1/audio/speech route

dotmobo · November 28, 2025, 8:47am

Hello everyone! I’m using vLLM for chat models, reranking, embedding, and speech-to-text. The only use case missing in vLLM is support for text-to-speech models like Piper and Kokoro. Could we consider adding support for these types of models in vLLM, including the OpenAI-compatible /v1/audio/speech endpoint, so they could be easily integrated into litellm, for example?

Thank you,

Morgan

RunLLM · November 28, 2025, 8:47am

vLLM currently does not support text-to-speech (TTS) models like Piper or Kokoro, nor does it provide an OpenAI-compatible /v1/audio/speech endpoint. The project has focused on LLMs, embeddings, reranking, and speech-to-text (ASR) via models like Whisper and Granite Speech, but there are no indications in the official documentation or issues that TTS support is planned or available. The architecture and focus of vLLM are not currently aligned with TTS models, which often require different processing pipelines and may involve diffusion or VAE components not supported by vLLM’s backend. Requests for similar multimodal or audio generation models (e.g., VibeVoice) have been noted, but maintainers have stated there are no plans to support diffusion-based or TTS models at this time, though contributions are welcome if you wish to propose an implementation yourself.

If you need TTS with an OpenAI-compatible API, you may need to use a dedicated TTS server or library and possibly build a wrapper to mimic the OpenAI /v1/audio/speech endpoint. For now, vLLM is best suited for LLM inference, embeddings, reranking, and ASR, not TTS generation or serving. Would you like more detail on the technical reasons or possible workarounds for integrating TTS with vLLM-based stacks?
Sources:

Usage with Audio Models?

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Jina AI VLM - Support help General	3	47	October 12, 2025
Does the latest version support deepseek-v3 tool call Model Support	0	98	April 12, 2025
Audio Transcription Auto Detection of Language General	1	75	September 1, 2025
Deployment example for a qwen3 model with hybrid thinking Reasoning Outputs	8	629	October 29, 2025
Raw tokens completion via online serving Features	1	33	November 3, 2025

Text to speech support with /v1/audio/speech route

Related topics