Hosting LLM using vLLM in cloud

dhruvil237 · July 28, 2025, 1:04pm

People who have hosted LLMs using vLLM, what approach did you guys take?
Listing down some approaches that I am considering. Would like to understand the associated complexity involved, ease of scaling for more models, more production loads, etc.

Ec2 (considering g5.xlarge) with ASG
Using k8s
Using frameworks like Anyscale, anything llm, autogen, bentoml etc.
Using integrations like kubeai, kuberay etc.

The frameworks and integrations are from vLLM docs under deployment. I am not much aware of what they exactly solve for but would like to understand if anyone of you have used those tools.

RunLLM · July 28, 2025, 1:04pm

For production vLLM deployments, users commonly choose between: (1) direct EC2 (e.g., g5.xlarge) with manual autoscaling, (2) Kubernetes (K8s) with or without Helm, and (3) managed or integrated frameworks like Anyscale, KubeAI, KubeRay, KServe, llmaz, or the vLLM production stack. Kubernetes-based approaches (with Helm or production stack) are favored for scalability, multi-model support, and observability, while frameworks like KubeAI and KubeRay simplify autoscaling, model management, and integration with existing K8s workflows. Anyscale offers managed Ray clusters for rapid, production-ready endpoints without K8s overhead. Each method varies in operational complexity and scaling ease: K8s and its integrations/frameworks generally offer the most flexibility and production features, but require more initial setup and K8s knowledge. Direct EC2 is simpler but less scalable and harder to manage for multiple models or high loads. See vLLM Kubernetes guide, production stack, KubeAI, KubeRay, and Anyscale for details.

Would you like a step-by-step comparison of setup, scaling, and operational complexity for these options?

Sources:

Topic		Replies	Views
vLLM Load balancing General	1	770	March 24, 2025
Deploy a big LLM when GPU VRAM not enough General	21	2162	August 13, 2025
Connecting to Ray Cluster and Launching vLLM with TP=8 and PP=2 in Deepseek R1 Mode on Kubernetes General	1	267	July 6, 2025
Embeddings with vLLM in Kubernetes General	5	100	August 28, 2025
Multi-node K8s GPU pooling Features	3	300	July 17, 2025

Hosting LLM using vLLM in cloud

Related topics