Are there plans to support OpenAI Realtime-API or other realtime API protocols?

sunhailin-Leo · July 6, 2025, 8:50pm

With the development of multimodal models, streaming input capability (or like Realtime API etc.) should be a feature that inference engines need to implement.

From the perspective of the protocol used, it should no longer be limited to HTTP, but should consider using streaming protocols such as Websockets and GRPC to receive client requests.

Although this may involve the modification of the model structure, but I think this should be a key feature of the inference engine to support multimodal model usage scenarios.

RunLLM · July 6, 2025, 8:50pm

You are correct: as multimodal models become more prevalent, streaming input capability is increasingly important for inference engines. The vLLM community is actively discussing and prototyping support for streaming protocols beyond HTTP, such as gRPC and ZeroMQ, to enable real-time, high-throughput, and overlapping CPU/GPU workloads for both text and multimodal inputs. Early experiments show that using protocols like ZeroMQ can significantly improve performance in high-QPS scenarios, and there is ongoing work to further optimize and harden these streaming interfaces for production use, including for multimodal models and future real-time APIs. However, full support for streaming multimodal input (e.g., video/audio chunking) and protocol-agnostic serving is still under development and not yet universally available in vLLM as of now. See the detailed discussions and benchmarks in this RFC and related issues.

Would you like more detail on the current implementation status or example usage?

Sources:

Topic		Replies	Views
Custom modality Multi-modality	3	15	November 14, 2025
Is Batch Inference for Multimodal Models Truly Batch Inference? General	9	352	August 1, 2025
Multimodal inference guideline? General	59	1514	August 6, 2025
Is there a plan to support cross-attention'd multimodal LLM? General	3	71	June 1, 2025
what is the recommended method to start up the vllm server engine for inferencing for InternVL3_5-8B, getting 2 qps? General	1	62	September 14, 2025

Are there plans to support OpenAI Realtime-API or other realtime API protocols?

Related topics