However, the patch didn’t work. When the model was running, it still executed the original function, and I had to modify unified_triton_attention.py directly to make it take effect.
I understand this might be related to multi-processing, but I’m wondering: what is the correct way and timing to apply such patches?
Monkey-patching functions like unified_attention in vLLM is often ineffective if done in the main process, because vLLM launches worker processes (via multiprocessing or Ray) that import modules independently. Patches applied in the parent process do not propagate to child worker processes, so your patch in gpu_worker.py’s __init__ will not affect the actual function used in those workers. This is a common limitation of Python multiprocessing, as each process has its own memory space and module imports.
The correct way is to ensure your patch is applied inside the worker process, before the function is first imported or used. This typically means placing your patch code at the very top of the worker module (e.g., at the start of vllm/v1/worker/gpu_worker.py), or by creating a custom worker class/module and specifying it via the vLLM config so that the patch is executed as part of the worker’s initialization. For more details, see the discussion in vllm/issues/6189 and the worker initialization logic in WorkerWrapperBase.init_worker.
Would you like a step-by-step example of how to implement this patching approach?