Implementing hidden state probes

LiquidGunay · January 30, 2026, 4:18pm

There has been a lot of recent work on trying to use the hidden state of an LLM as an input to classifiers for identifying things like harmful content and jailbreaks (like Anthropic’s two-stage classifier approach). This kind of system might be favourable compared to what Iris is trying to do for jailbreak detection. Are there any plans to allow the easy integration of probes on hidden states as part of the main vLLM engine without having to specially patch it?

RunLLM · January 30, 2026, 4:19pm

Currently, vLLM does not provide a built-in, officially supported API for directly integrating probes or classifiers on hidden states as part of the main engine. While there is strong community interest and several RFCs and PRs have discussed exposing hidden states for downstream tasks (like classification or interpretability), the maintainers have not merged such features into the core due to concerns about performance, complexity, and multi-tenant stability. Most existing solutions require out-of-tree patches, plugins, or custom modifications, and there is no clear roadmap for first-class support in the mainline engine yet. For example, proposals like a HiddenStatesProcessor or plugin system have been discussed, but not adopted as a default feature (RFC: Hidden States Extraction, Hidden states processor RFC, PR: Hidden State Processors via plugins).

If you need this functionality now, you would need to use community plugins, forked patches, or custom hooks, as described in the RFCs and issues. The maintainers have indicated that such features may remain opt-in or out-of-tree due to their impact on performance and resource usage (Hidden States Extraction RFC, FAQ on hidden states). Would you like more detail on the current workarounds or community plugins for extracting and probing hidden states in vLLM?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Customized Model Generation Verification General	1	93	July 14, 2025
Extract class log probabilities from LLM classifier using async vLLM and AWS LMI containers General	1	162	April 28, 2025
How to switch the pooling method of pooling models General	17	415	August 26, 2025
Can I get log prob with classify? General	1	96	September 19, 2025
Numerical Difference between vLLM logprobs and huggingface logprobs RL Integration	7	4608	April 4, 2025

Implementing hidden state probes

Related topics