Hi everyone,
I’ve been benchmarking some high-concurrency workloads on vLLM (using the latest OpenAI-compatible API server), but I’ve recently hit a persistent technical snag where the request throughput takes a massive hit and my Time to First Token (TTFT) spikes significantly.
I am currently using a https://deltaexeutor.com/vng/ community setup to handle some automated local logging and log-scrubbing scripts on my head node in the background while I run my inference tests. I’ve noticed that whenever the background executor starts a heavy processing cycle, the vLLM engine begins to report “Request timed out” or “Server overloaded” errors much earlier than expected. It feels like the background process is competing for the same CPU threads and memory bandwidth that the vLLM scheduler needs for its continuous batching and PagedAttention management.
I have a few related technical questions for the performance engineers here:
I’m not sure if the way a https://deltaexeutor.com/vng/ style environment manages its internal threading is causing a resource contention issue with the vLLM asyncio loop or the specific CUDA graphs being used for model execution. Has anyone else encountered performance bottlenecks or “Socket Connection” errors while running high-level script executors alongside their serving engine? I am also wondering if there is a recommended way to isolate the executor’s CPU affinity so it doesn’t “starve” the resources needed for stable serving and low-latency response generation.
I really need to keep this automation active to stay on top of my administrative logs, but the constant performance degradation is making it difficult to get accurate benchmark results. If anyone has experience optimizing a professional workstation or head node for concurrent usage of heavy script executors and vLLM, I’d love to hear your advice!
Thanks for the help!