Does vLLM come with its own load balancer. Whats the recommended way of load balancing an inference endpoint with vllm
The load balancers currently lie outside of the vLLM repo and are more like a component on top of it. Available options AFAIK are:
- vLLM Production Stack: GitHub - vllm-project/production-stack: vLLM’s reference system for K8S-native cluster-wide deployment with community-driven performance optimization
- AIBrix: GitHub - vllm-project/aibrix: Cost-efficient and pluggable Infrastructure components for GenAI inference
- Ray Serve: Serving LLMs — Ray 3.0.0.dev0
1 Like