Are there plans to support OpenAI Realtime-API or other realtime API protocols?

With the development of multimodal models, streaming input capability (or like Realtime API etc.) should be a feature that inference engines need to implement.

From the perspective of the protocol used, it should no longer be limited to HTTP, but should consider using streaming protocols such as Websockets and GRPC to receive client requests.

Although this may involve the modification of the model structure, but I think this should be a key feature of the inference engine to support multimodal model usage scenarios.

You are correct: as multimodal models become more prevalent, streaming input capability is increasingly important for inference engines. The vLLM community is actively discussing and prototyping support for streaming protocols beyond HTTP, such as gRPC and ZeroMQ, to enable real-time, high-throughput, and overlapping CPU/GPU workloads for both text and multimodal inputs. Early experiments show that using protocols like ZeroMQ can significantly improve performance in high-QPS scenarios, and there is ongoing work to further optimize and harden these streaming interfaces for production use, including for multimodal models and future real-time APIs. However, full support for streaming multimodal input (e.g., video/audio chunking) and protocol-agnostic serving is still under development and not yet universally available in vLLM as of now. See the detailed discussions and benchmarks in this RFC and related issues.

Would you like more detail on the current implementation status or example usage?

Sources: