There has been a lot of recent work on trying to use the hidden state of an LLM as an input to classifiers for identifying things like harmful content and jailbreaks (like Anthropic’s two-stage classifier approach). This kind of system might be favourable compared to what Iris is trying to do for jailbreak detection. Are there any plans to allow the easy integration of probes on hidden states as part of the main vLLM engine without having to specially patch it?
Currently, vLLM does not provide a built-in, officially supported API for directly integrating probes or classifiers on hidden states as part of the main engine. While there is strong community interest and several RFCs and PRs have discussed exposing hidden states for downstream tasks (like classification or interpretability), the maintainers have not merged such features into the core due to concerns about performance, complexity, and multi-tenant stability. Most existing solutions require out-of-tree patches, plugins, or custom modifications, and there is no clear roadmap for first-class support in the mainline engine yet. For example, proposals like a HiddenStatesProcessor or plugin system have been discussed, but not adopted as a default feature (RFC: Hidden States Extraction, Hidden states processor RFC, PR: Hidden State Processors via plugins).
If you need this functionality now, you would need to use community plugins, forked patches, or custom hooks, as described in the RFCs and issues. The maintainers have indicated that such features may remain opt-in or out-of-tree due to their impact on performance and resource usage (Hidden States Extraction RFC, FAQ on hidden states). Would you like more detail on the current workarounds or community plugins for extracting and probing hidden states in vLLM?
Sources:
- RFC: Hidden States Extraction
- Hidden states processor RFC
- PR: Hidden State Processors via plugins
- FAQ on hidden states
Hint: Mention @RunLLM in the post for followups.