Is there a hook/flag to capture activation statistics during inference for use with llm-compressor AWQ?

Context AWQ quantization via llm-compressor requires activation statistics (per-channel magnitudes) collected during calibration. Typically this uses a small proxy dataset, which may not represent real production traffic.

What I want to do Collect activation-aware weight statistics during live inference on the full precision bf16/fp16 model (real customer traffic), and feed those stats into llm-compressor to do AWQ quantization giving a more accurate, production-representative quantization.

Question Does vLLM currently expose any option, flag, or hook to capture per-layer/per-channel activation statistics during inference?

It seems likely that vLLM does not currently provide a built-in option, flag, or official API to directly capture per-layer or per-channel activation statistics during inference for use in AWQ quantization calibration. The recommended workflow for AWQ quantization in vLLM relies on calibration with a proxy dataset, and collecting activation statistics from live inference is not explicitly documented as a supported feature. The documentation and issues suggest that custom hooks or source code modifications are required to access activations, and there is no mention of a standardized interface for this purpose in the quantization or AWQ documentation (AutoAWQ — vLLM, AWQConfig API).

Some users have discussed using PyTorch forward hooks or modifying the model to capture activations, but these approaches are not officially supported or exposed as part of the vLLM API (github_issue_26147, github_issue_4084). You may need to implement custom hooks or modify the vLLM model runner to collect the required statistics during inference. Would you like more detail on how to implement such hooks or where in the vLLM codebase you might insert them?

Sources:

Hint: Mention @RunLLM in the post for followups.

Hi @ManikandanB , can’t you log output tokens in vllm, and use that as your calibration dataset in llm-compressor? The pipeline will regenerate hidden states layer-by-layer, so you can use your production dataset without necessarily having to save all hidden states for every layer. Feel free to raise as an issue on llm-compressor and we can follow up there. We don’t typically monitor this discussion board

Thanks. The challenge in my case is that we are not permitted to retain production prompts/completions or customer data for later replay.

What I was hoping to collect are only aggregate activation statistics (e.g., per-channel running maxima, mean absolute values, histograms, percentile estimates, etc.) during inference on the FP16/BF16 model, without storing any customer inputs, outputs, or hidden states.

If vLLM currently doesn’t expose any hook or extension point that would allow accumulation of such activation summaries during inference, would this be something better discussed as a feature request in vLLM or llm-compressor?