A local vLLM orchestrator (CLI & Web UI) for VRAM pre-calculation and CPU deployments

Hi everyone,

Like most people here, I absolutely love vLLM for its incredible inference speed. However, as I started deploying it locally, my biggest operational headache was hardware management.

Guessing the exact gpu_memory_utilization or manually calculating the KV cache footprint often led to Out-Of-Memory (OOM) crashes. On the flip side, sometimes I just wanted to bypass the GPU entirely and run a smaller model on my system CPU/RAM, but managing the different Docker runtimes and flags was getting tedious.

I wanted a more automated, “push-button” dev experience for both GPU and CPU workloads, so I built vAquila.

It’s an open-source Python orchestrator wrapper for vLLM, and it’s also my very first AI infrastructure project! I am releasing the Beta version today.

How it works under the hood:

  • Mathematical VRAM Pre-checks: Before launching a GPU engine, vAquila reads the model’s config.json. It uses pynvml to dynamically calculate the required VRAM (Base Weights + Estimated KV Cache + CUDA overhead). It only spins up the container if the math says it will fit.

  • First-Class CPU Support: Don’t have enough VRAM? You can explicitly target the CPU from the UI or CLI. It will allocate the model to your system RAM and spin up the appropriate CPU-optimized vLLM container.

  • Containerized Execution: It automatically handles the vLLM Docker containers via the Python SDK. You never have to write long docker run commands again.

  • Real-time Observability: It includes a Typer-based CLI (vaq) and a local FastAPI Web UI (Dark mode included) to monitor your GPU, system RAM, and CPU thread metrics live while vLLM is serving requests.

Why I’m posting here: I genuinely hope this tool can be a useful addition to the vLLM ecosystem and help others who might be struggling with the same deployment hurdles.

Since this community knows vLLM better than anyone, I would be incredibly grateful for your technical feedback on this Beta. Specifically, I would love your thoughts on my VRAM estimation logic for GPUs: Is calculating the KV cache mathematically (Layers * KV Heads * Head Dim * Bytes * Max Tokens * Max Seqs) robust enough for production, or are there edge cases in vLLM’s memory allocation that I should account for?

Project link: vAquila | Local AI Orchestration

Thanks for your time, and I’m looking forward to your feedback to help improve the tool!

Your approach—mathematically estimating VRAM needs using model config (Layers × KV Heads × Head Dim × Bytes × Max Tokens × Max Seqs) plus weights and CUDA overhead—is broadly correct and aligns with how vLLM internally profiles memory for the KV cache and weights. However, vLLM’s actual allocation also includes activation memory, non-torch allocations, and CUDA graph buffers, which can cause real-world usage to exceed pure math estimates, especially with features like CUDA graphs or multi-modal models. For production robustness, it’s recommended to add a buffer (e.g., 150–500 MiB) and, if possible, profile with dummy runs as vLLM does internally (see code).

For CPU workloads, vLLM supports explicit KV cache sizing via the VLLM_CPU_KVCACHE_SPACE environment variable, and your orchestration logic should ensure this is set based on available RAM (see docs). Your tool’s container orchestration and observability features are a strong addition to the ecosystem, and your VRAM estimation logic is sound for most cases, but consider edge cases like CUDA graph memory and multi-instance deployments for maximum reliability.

Would you like a step-by-step breakdown of vLLM’s memory profiling and edge cases?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Thank you for the detailed validation and the fantastic insights! It is incredibly encouraging to hear that the core math aligns with vLLM’s internal profiling.

Yes, I would absolutely love a step-by-step breakdown of vLLM’s memory profiling and edge cases. I am particularly interested in understanding how the CUDA graph buffers and activation memory scale: is it a flat rate, or does it scale linearly with the max tokens/batch size?

Right now, vAquila adds a flat ~1.5GB to 2GB overhead to the calculation to absorb the PyTorch context and these non-torch allocations. If I can replace this flat rate with the actual formula for CUDA graphs, the tool will be perfectly precise.

Also, thank you for the tip regarding the VLLM_CPU_KVCACHE_SPACE environment variable! I will explicitly integrate this into the Docker runtime arguments for CPU deployments in the next Beta update to make RAM allocation much safer.

Looking forward to the breakdown!

Here’s a step-by-step breakdown of vLLM’s memory profiling:

  1. Model Weights: vLLM loads model weights first and measures their memory usage directly (see code). This is a fixed cost based on model size and dtype.
  2. Activation Memory: During a profiling run, vLLM runs a dummy forward pass with the configured max_model_len and max_num_seqs to empirically measure peak activation memory. This activation memory scales with both batch size and sequence length, but the scaling is not strictly linear due to kernel and backend differences (see docs).
  3. CUDA Graph Buffers: If CUDA graphs are enabled (the default), vLLM captures graphs for various batch sizes. The memory for CUDA graphs is reserved up front and can be significant—often hundreds of MiB to several GiB depending on the number of capture sizes and model architecture. The scaling is stepwise: each captured graph for a batch size reserves a buffer, so more capture sizes or larger batch sizes increase total reserved memory (see code).
  4. Non-Torch Allocations: vLLM also accounts for memory used by NCCL, CUDA context, and other non-torch allocations, which are measured empirically during profiling.

Key edge cases:

  • CUDA graph memory is not a flat rate; it increases with the number and size of captured graphs (i.e., more batch sizes or larger max batch size = more memory).
  • Activation memory can spike for large batch sizes or long sequences, especially if logprobs or multi-modal features are enabled.
  • Fragmentation and non-torch allocations can cause actual usage to exceed estimates, so a buffer is still recommended.

Would you like a code example or more detail on how to empirically measure these values?

Sources:

Hint: Mention @RunLLM in the post for followups.