Hosting LLM using vLLM in cloud

People who have hosted LLMs using vLLM, what approach did you guys take?
Listing down some approaches that I am considering. Would like to understand the associated complexity involved, ease of scaling for more models, more production loads, etc.

  1. Ec2 (considering g5.xlarge) with ASG
  2. Using k8s
  3. Using frameworks like Anyscale, anything llm, autogen, bentoml etc.
  4. Using integrations like kubeai, kuberay etc.

The frameworks and integrations are from vLLM docs under deployment. I am not much aware of what they exactly solve for but would like to understand if anyone of you have used those tools.

For production vLLM deployments, users commonly choose between: (1) direct EC2 (e.g., g5.xlarge) with manual autoscaling, (2) Kubernetes (K8s) with or without Helm, and (3) managed or integrated frameworks like Anyscale, KubeAI, KubeRay, KServe, llmaz, or the vLLM production stack. Kubernetes-based approaches (with Helm or production stack) are favored for scalability, multi-model support, and observability, while frameworks like KubeAI and KubeRay simplify autoscaling, model management, and integration with existing K8s workflows. Anyscale offers managed Ray clusters for rapid, production-ready endpoints without K8s overhead. Each method varies in operational complexity and scaling ease: K8s and its integrations/frameworks generally offer the most flexibility and production features, but require more initial setup and K8s knowledge. Direct EC2 is simpler but less scalable and harder to manage for multiple models or high loads. See vLLM Kubernetes guide, production stack, KubeAI, KubeRay, and Anyscale for details.

Would you like a step-by-step comparison of setup, scaling, and operational complexity for these options?

Sources: