Hi everyone,
I’ve been running some benchmarks on vLLM (v0.4.x) using an L40S setup to serve a Llama-3 70B model, and I’ve run into a strange performance degradation issue that doesn’t seem to be related to the KV cache size or request rate. Generally, my tokens-per-second (TPS) are very stable, but I’ve noticed a significant spike in P99 latency whenever my local workstation or testing node is under a high background script load.
I’ve been testing some blox scripts in a separate container on the same host to simulate a high-frequency automation environment while I run my inference requests. It seems like as soon as those background executors start ramping up their CPU and memory calls, vLLM’s continuous batching starts to stutter, and I see “Server is overloaded” warnings even when my VRAM utilization is well within the limits.
Has anyone else noticed vLLM’s scheduler becoming sensitive to other third-party executors or intensive script environments running on the same OS? I’m trying to figure out if I should be pinning the vLLM process to specific CPU cores or if there is a known issue with how the Ray actor handles resource sharing when the system-wide thread count gets too high.
Is there a way to give the vLLM engine a higher priority over other local scripts to prevent these latency spikes, or should I be looking into more aggressive memory locking to ensure the model weights don’t get paged during these background bursts? I’d really appreciate any tips on maintaining high throughput while multitasking with other high-load tools on the same hardware.