V1 Engine child process dies unnoticed; check_health() is a no-op

Hi everyone,

I’ve been experimenting with the new V1 engine in vLLM because of its impressive speed and much lower memory usage compared to V0. I understand that V1 is still experimental, but these performance gains are exactly what I need. Unfortunately, there’s a blocker that’s making it effectively unusable for me right now.

What’s happening

When I spin up an AsyncLLMEngine, vLLM correctly launches a separate EngineCore subprocess. However, if that subprocess crashes (OOM, segfault, manual kill, etc.), the parent process never notices. The built-in health check is literally:

async def check_health(self) -> None:
    logger.debug("Called check_health.")

so it just returns immediately and keeps sending RPCs to a dead socket. No errors, no restart—just a silent hang.

  • Is there an officially recommended way to health-check or automatically restart a dead V1 EngineCore?
  • Would it make sense for vLLM to expose a built-in watchdog or supervisor API for V1 engines?
  • Has anyone run V1 in production and found a more robust pattern for detecting and recovering from subprocess death?

Any pointers, examples, or links to ongoing discussion would be hugely appreciated. Thanks in advance!

There is no officially documented, robust built-in watchdog or supervisor API for automatically detecting and restarting a dead V1 EngineCore in vLLM as of now. The V1 engine’s health check is minimal and does not reliably detect subprocess death; if the EngineCore process dies (e.g., OOM, segfault), the parent process may hang and continue sending RPCs to a dead socket without error or restart, as you described. This limitation is acknowledged in the community, and users have requested more reliable health-checking and recovery mechanisms, but no official solution is currently provided in the documentation or codebase for V1 engines (see discussion, multiprocessing design doc).

Some users have implemented custom workarounds, such as monitoring the subprocess externally and restarting the parent process if the child dies, but these are not officially supported or standardized. There is ongoing discussion about improving this area, and related issues and PRs (e.g., #7633, #6594) have addressed similar problems for V0 and the OpenAI server, but not for V1’s EngineCore. Would you like more detail on possible custom patterns or code examples for external monitoring?

Sources:

Hey! Thanks for pointing this out. In the design of V1, we intended for issues in subprocesses to be bubbled up to the AsyncLLM automatically and for the server to shut down. If this is not happening it is a bug. If you have any reliable reproduction of this I would love to see it so we can fix it.

That being said, you are right that we should have a back up to check health explicitly. If you open an issue with a feature request we can track it. Thanks!

You are correct: in V1, the intended design is for subprocess failures (like EngineCore death) to propagate up to AsyncLLM, causing the server to shut down. If this is not happening, it is considered a bug, and the maintainers are interested in reliable reproduction steps to help fix it (see discussion). There is currently no explicit, robust backup health check for V1, and the maintainers encourage opening a feature request to track this need.

Would you like more detail on how to reproduce the issue or on how to file an effective feature request?

Sources:

Hi. I’ve opened the issue with code examples and my temporary workaround: [Bug]: Subprocess health check / automatic restart for V1 EngineCore · Issue #19849 · vllm-project/vllm · GitHub

Thank you for sharing your issue and workaround. Your GitHub issue (#19849) clearly documents the problem with V1 EngineCore subprocess death not being detected by the parent, and it will help the maintainers track and address this bug. There is no officially recommended fix yet, but your contribution is valuable for improving vLLM’s robustness.

Would you like a summary of your workaround or suggestions for further steps?

Sources: