How to monkey patch vLLM correctly?

fopdoodle8 · November 8, 2025, 2:36am

I was trying to monkey-patch unified_attention in vllm/attention/ops/unified_triton_attention.py.

I attempted to do this by adding my patch function inside the __init__ method of vllm/v1/worker/gpu_worker.py:

class Worker(WorkerBase):
    def __init__(
        self,
        vllm_config: VllmConfig,
        local_rank: int,
        rank: int,
        distributed_init_method: str,
        is_driver_worker: bool = False,
    ):
        super().__init__(
            vllm_config=vllm_config,
            local_rank=local_rank,
            rank=rank,
            distributed_init_method=distributed_init_method,
            is_driver_worker=is_driver_worker,
        )
        # my patch
        patch_triton_attention()

However, the patch didn’t work. When the model was running, it still executed the original function, and I had to modify unified_triton_attention.py directly to make it take effect.

I understand this might be related to multi-processing, but I’m wondering: what is the correct way and timing to apply such patches?

RunLLM · November 8, 2025, 2:37am

Monkey-patching functions like unified_attention in vLLM is often ineffective if done in the main process, because vLLM launches worker processes (via multiprocessing or Ray) that import modules independently. Patches applied in the parent process do not propagate to child worker processes, so your patch in gpu_worker.py’s __init__ will not affect the actual function used in those workers. This is a common limitation of Python multiprocessing, as each process has its own memory space and module imports.

The correct way is to ensure your patch is applied inside the worker process, before the function is first imported or used. This typically means placing your patch code at the very top of the worker module (e.g., at the start of vllm/v1/worker/gpu_worker.py), or by creating a custom worker class/module and specifying it via the vLLM config so that the patch is executed as part of the worker’s initialization. For more details, see the discussion in vllm/issues/6189 and the worker initialization logic in WorkerWrapperBase.init_worker.

Would you like a step-by-step example of how to implement this patching approach?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
In single node deployment environment, how can we make vllm call unified_attention more often to trigger KVCache connector workload General	12	256	June 24, 2025
LoRA Adapter enabling with vLLM is not working LoRA	4	577	April 21, 2025
How to serve a transformed Pytorch model General	1	88	September 2, 2025
Mukti-GPUs on vLLM using a custom network General	5	101	September 5, 2025
Vllm/worker/model_runner.py的执行条件 General	7	277	July 21, 2025

How to monkey patch vLLM correctly?

Related topics