Is there a hook/flag to capture activation statistics during inference for use with llm-compressor AWQ?

ManikandanB · June 4, 2026, 1:49am

Context AWQ quantization via llm-compressor requires activation statistics (per-channel magnitudes) collected during calibration. Typically this uses a small proxy dataset, which may not represent real production traffic.

What I want to do Collect activation-aware weight statistics during live inference on the full precision bf16/fp16 model (real customer traffic), and feed those stats into llm-compressor to do AWQ quantization giving a more accurate, production-representative quantization.

Question Does vLLM currently expose any option, flag, or hook to capture per-layer/per-channel activation statistics during inference?

RunLLM · June 4, 2026, 1:16pm

It seems likely that vLLM does not currently provide a built-in option, flag, or official API to directly capture per-layer or per-channel activation statistics during inference for use in AWQ quantization calibration. The recommended workflow for AWQ quantization in vLLM relies on calibration with a proxy dataset, and collecting activation statistics from live inference is not explicitly documented as a supported feature. The documentation and issues suggest that custom hooks or source code modifications are required to access activations, and there is no mention of a standardized interface for this purpose in the quantization or AWQ documentation (AutoAWQ — vLLM, AWQConfig API).

Some users have discussed using PyTorch forward hooks or modifying the model to capture activations, but these approaches are not officially supported or exposed as part of the vLLM API (github_issue_26147, github_issue_4084). You may need to implement custom hooks or modify the vLLM model runner to collect the required statistics during inference. Would you like more detail on how to implement such hooks or where in the vLLM codebase you might insert them?

Sources:

Hint: Mention @RunLLM in the post for followups.

brian-dellabetta · June 4, 2026, 1:55pm

Hi @ManikandanB , can’t you log output tokens in vllm, and use that as your calibration dataset in llm-compressor? The pipeline will regenerate hidden states layer-by-layer, so you can use your production dataset without necessarily having to save all hidden states for every layer. Feel free to raise as an issue on llm-compressor and we can follow up there. We don’t typically monitor this discussion board

ManikandanB · June 4, 2026, 5:43pm

Thanks. The challenge in my case is that we are not permitted to retain production prompts/completions or customer data for later replay.

What I was hoping to collect are only aggregate activation statistics (e.g., per-channel running maxima, mean absolute values, histograms, percentile estimates, etc.) during inference on the FP16/BF16 model, without storing any customer inputs, outputs, or hidden states.

If vLLM currently doesn’t expose any hook or extension point that would allow accumulation of such activation summaries during inference, would this be something better discussed as a feature request in vLLM or llm-compressor?

Topic		Replies	Views
A bit of frustration with Quantization Quantization	5	850	October 14, 2025
Speeding up vllm inference for Qwen2.5-VL General	23	8215	June 27, 2025
How log kvcache usage and prefix hit rate in offline infer? General	19	1119	May 13, 2025
RunBot's math-to-text on NVIDIA NeMo Framework AutoModel LoRA	11	227	May 19, 2025
Customized Model Generation Verification General	1	102	July 14, 2025

Is there a hook/flag to capture activation statistics during inference for use with llm-compressor AWQ?

Related topics