Your vLLM instance is crashing due to a TimeoutError in the multiprocessing executor, which is often caused by requests that are too large (e.g., very long prompts or outputs) or by resource exhaustion (e.g., not enough GPU memory for the requested batch/context size). The logs show a request with 83,338 prompt tokens and 47,734 output tokens, which is extremely large and likely exceeds what the model and hardware can handle reliably, even with 8xH200 and a reduced --gpu-memory-utilization of 0.7. This can cause the engine to hang or crash, resulting in EngineDeadError and timeouts as seen in your logs.
To mitigate this, reduce --max-model-len, --max-num-batched-tokens, and/or --max-num-seqs to limit the maximum request size, and consider enforcing stricter input validation on the client side to reject overly large requests. For more troubleshooting strategies and memory conservation tips, see the official vLLM troubleshooting and memory documentation (troubleshooting, conserving memory). Would you like a step-by-step breakdown or specific parameter recommendations?
Sources:
Hint: Mention @RunLLM in the post for followups.