Sampler.hip:564:63: error: local memory (66032) exceeds limit (65536) in 'void vllm::topKPerRowDecode<1024, true, false, true>'

I am trying to build vllm on wsl2 with rocm as specified here: GPU - vLLM, however my build stops with the following error:

/home/joefe/repos/vllm/build/temp.linux-x86_64-cpython-312/csrc/sampler.hip:564:63: error: local memory (66032) exceeds limit (65536) in ‘void vllm::topKPerRowDecode<1024, true, false, true>(float const\*, int const\*, int\*, int, int, int, int, float\*, int, int const\*)’
564 | static **global** **launch_bounds**(kNumThreadsPerBlock) void topKPerRowDecode(
|                                                               ^
16 warnings and 1 error generated when compiling for gfx1201.

Rocm (6.4) and pytorch (2.9) appear to be working fine.
I am on WSL2 with a radeon RX9070 (16gb).

I have also tried nightly rocm 7 & nightly pytorch 2.11.

collect_env: collect_env - Pastebin.com

Is this a bug or some kind of enviroment setup issue? Thanks.

It seems likely that this is a build-time resource limit issue, not a pure environment misconfiguration. The error indicates that the kernel topKPerRowDecode is using more local memory (66032 bytes) than allowed (65536 bytes) for your GPU architecture (gfx1201), which is enforced by the ROCm/HIP compiler. This is a known limitation when compiling large kernels for certain AMD GPUs, especially with high thread/block counts or large shared memory allocations, and is not unique to your WSL2 setup or ROCm version. There is no explicit workaround or fix for this in the official vLLM documentation, and the error is not directly addressed in the ROCm or vLLM installation guides for AMD GPUs or WSL2 environments, though similar build issues are discussed in the context of resource limits and kernel compilation failures (docs.vllm.ai ROCm install, github_issue #10855).

You may need to patch the kernel code to reduce shared/local memory usage, lower the number of threads per block, or wait for an upstream fix. Alternatively, you could try building on native Linux or with a different ROCm version, but there is no guarantee this will resolve the kernel resource limit. Would you like more detail on how to attempt a workaround or patch the kernel code?

Sources:

Hint: Mention @RunLLM in the post for followups.