BG: I customize a model apart from original Qwen2 model, with monkey patch some of the forward function. But now I’m encounted with a problem that my vllm output is quite different from that of huggingface, and I was not able to find out the bug by just reviewing the code. I’m looking forward to debug it by checking the intermediate hidden states per layer, but that seems not obvious under vllm (huggingface with output_hidden_states=True works)
Does anybody know an effective way to debug such case? Thanks for responding XD
vLLM does not natively support returning intermediate hidden states for each layer in the same way as HuggingFace’s output_hidden_states=True option. This limitation is due to performance concerns and architectural choices; vLLM is optimized for fast inference and does not accumulate or expose per-layer hidden states by default. There have been community discussions and PRs proposing this feature, but maintainers have not merged such functionality into mainline vLLM, and recommend using HuggingFace Transformers for this use case instead. If you need to debug by inspecting hidden states, you may need to modify vLLM’s model code directly (e.g., by adding hooks or custom print statements), or temporarily switch to HuggingFace for debugging purposes. See the discussion in vllm-project/vllm#20624 and vllm-project/vllm#6165.
Would you like more detail on how to patch vLLM to print or extract hidden states for debugging? Or do you want advice on alternative debugging strategies?