Hello everyone! I’m using vLLM for chat models, reranking, embedding, and speech-to-text. The only use case missing in vLLM is support for text-to-speech models like Piper and Kokoro. Could we consider adding support for these types of models in vLLM, including the OpenAI-compatible /v1/audio/speech endpoint, so they could be easily integrated into litellm, for example?
vLLM currently does not support text-to-speech (TTS) models like Piper or Kokoro, nor does it provide an OpenAI-compatible /v1/audio/speech endpoint. The project has focused on LLMs, embeddings, reranking, and speech-to-text (ASR) via models like Whisper and Granite Speech, but there are no indications in the official documentation or issues that TTS support is planned or available. The architecture and focus of vLLM are not currently aligned with TTS models, which often require different processing pipelines and may involve diffusion or VAE components not supported by vLLM’s backend. Requests for similar multimodal or audio generation models (e.g., VibeVoice) have been noted, but maintainers have stated there are no plans to support diffusion-based or TTS models at this time, though contributions are welcome if you wish to propose an implementation yourself.
If you need TTS with an OpenAI-compatible API, you may need to use a dedicated TTS server or library and possibly build a wrapper to mimic the OpenAI /v1/audio/speech endpoint. For now, vLLM is best suited for LLM inference, embeddings, reranking, and ASR, not TTS generation or serving. Would you like more detail on the technical reasons or possible workarounds for integrating TTS with vLLM-based stacks?
Sources: