Yes, your approach is valid and aligns with current vLLM and NCCL behavior. When a rank (worker) crashes, NCCL collective operations will hang or error, causing the remaining processes to wait indefinitely. The recommended solution is to detect the dead worker, abort NCCL communication, and restart the vLLM engine to reinitialize the distributed environment. This is necessary because NCCL communicators cannot be reused after a fatal error or process death, and a full reinitialization is required for correct operation. This pattern is discussed in detail in vllm-project/vllm#24454 and related issues.
Recent RFCs and discussions also confirm that, upon unrecoverable failures (e.g., GPU or process death), the parent process should detect the failure, abort or rebuild NCCL communicators, and restart the engine or cluster as needed. This ensures that all ranks are synchronized and communication can resume safely. See vllm-project/vllm#30112 for a detailed breakdown of fault tolerance and recovery strategies in vLLM.
Would you like more detail on how to implement this detection and restart logic?