vLLM throughput dropping when running concurrent background executors?

Hi everyone,
I’ve been running some benchmarks on vLLM (v0.4.x) using an L40S setup to serve a Llama-3 70B model, and I’ve run into a strange performance degradation issue that doesn’t seem to be related to the KV cache size or request rate. Generally, my tokens-per-second (TPS) are very stable, but I’ve noticed a significant spike in P99 latency whenever my local workstation or testing node is under a high background script load.

I’ve been testing some blox scripts in a separate container on the same host to simulate a high-frequency automation environment while I run my inference requests. It seems like as soon as those background executors start ramping up their CPU and memory calls, vLLM’s continuous batching starts to stutter, and I see “Server is overloaded” warnings even when my VRAM utilization is well within the limits.

Has anyone else noticed vLLM’s scheduler becoming sensitive to other third-party executors or intensive script environments running on the same OS? I’m trying to figure out if I should be pinning the vLLM process to specific CPU cores or if there is a known issue with how the Ray actor handles resource sharing when the system-wide thread count gets too high.

Is there a way to give the vLLM engine a higher priority over other local scripts to prevent these latency spikes, or should I be looking into more aggressive memory locking to ensure the model weights don’t get paged during these background bursts? I’d really appreciate any tips on maintaining high throughput while multitasking with other high-load tools on the same hardware.

Yes, vLLM’s performance—especially latency and throughput—can degrade significantly if CPU resources are underprovisioned or heavily contended by other processes, even if GPU/VRAM utilization is low. The vLLM engine core, input processing (tokenization), and output streaming all run on CPU, and contention here directly impacts scheduling latency and batching efficiency, leading to “Server is overloaded” warnings and higher P99 latency, as you’ve observed. Pinning vLLM processes to dedicated CPU cores and increasing their priority can help mitigate these issues; it’s recommended to allocate at least 2 + N physical CPU cores (where N = number of GPUs) exclusively for vLLM, and avoid running other CPU-intensive workloads on the same cores or host if possible.

You can also use OS-level tools (e.g., taskset, nice, cgroups) to isolate vLLM from background scripts, and consider memory locking (mlock) if you suspect paging, though CPU contention is usually the main culprit. For Ray-based deployments, ensure Ray actors are not oversubscribed and that vLLM’s resource requests are honored. For more details, see vLLM Optimization: Performance Impact and Minimum CPU Requirements. Would you like a step-by-step guide on how to set CPU affinity and process priority for vLLM?

Sources:

Hint: Mention @RunLLM in the post for followups.