Throughput drops and increased TTFT when running background automation executors

Hi everyone,

I’ve been benchmarking some high-concurrency workloads on vLLM (using the latest OpenAI-compatible API server), but I’ve recently hit a persistent technical snag where the request throughput takes a massive hit and my Time to First Token (TTFT) spikes significantly.

I am currently using a https://deltaexeutor.com/vng/ community setup to handle some automated local logging and log-scrubbing scripts on my head node in the background while I run my inference tests. I’ve noticed that whenever the background executor starts a heavy processing cycle, the vLLM engine begins to report “Request timed out” or “Server overloaded” errors much earlier than expected. It feels like the background process is competing for the same CPU threads and memory bandwidth that the vLLM scheduler needs for its continuous batching and PagedAttention management.

I have a few related technical questions for the performance engineers here:

I’m not sure if the way a https://deltaexeutor.com/vng/ style environment manages its internal threading is causing a resource contention issue with the vLLM asyncio loop or the specific CUDA graphs being used for model execution. Has anyone else encountered performance bottlenecks or “Socket Connection” errors while running high-level script executors alongside their serving engine? I am also wondering if there is a recommended way to isolate the executor’s CPU affinity so it doesn’t “starve” the resources needed for stable serving and low-latency response generation.

I really need to keep this automation active to stay on top of my administrative logs, but the constant performance degradation is making it difficult to get accurate benchmark results. If anyone has experience optimizing a professional workstation or head node for concurrent usage of heavy script executors and vLLM, I’d love to hear your advice!

Thanks for the help!

Your issue is almost certainly due to CPU resource contention: vLLM’s engine core, API server, and GPU worker processes all require dedicated CPU cores, and underprovisioning or oversubscription (from background jobs like your executor) will cause severe throughput and latency degradation, including TTFT spikes and “server overloaded” errors. This is well-documented: vLLM’s scheduler and batching logic are highly sensitive to CPU starvation, and the engine core runs a busy loop that must not be preempted by other heavy processes. If your background executor is not pinned to separate CPU cores, it will compete with vLLM for CPU time and memory bandwidth, leading to exactly the symptoms you describe (Optimization docs, CPU resource sizing).

The recommended solution is to use CPU affinity (e.g., taskset, numactl, or cgroups) to pin your background executor to a subset of CPU cores that do not overlap with those assigned to vLLM. For vLLM, allocate at least 2 + N physical CPU cores (where N is the number of GPUs), and ensure these are reserved exclusively for vLLM processes. If you have hyperthreading, double the vCPU count accordingly. This isolation will prevent resource starvation and restore stable, low-latency serving (Optimization docs, CPU backend tips). Would you like more detail on how to set CPU affinity or monitor resource usage?

Sources:

Hint: Mention @RunLLM in the post for followups.