Running INT4 models on CPU

phalani-paladugu · August 8, 2025, 8:53pm

Hi, I’m attempting to run the RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 model on CPU using vLLM v0.10.0. I’m encountering a TorchDynamo related error that I hadn’t seen in earlier versions, possibly because Dynamo wasn’t enabled before?

The model runs successfully when I set TORCHDYNAMO_DISABLE=1. Are there any performance trade-offs or limitations associated with disabling Dynamo in this context?

ERROR 08-08 20:12:53 [core.py:632]   File "/home/ubuntu/miniforge3/lib/python3.12/site-packages/torch/_dynamo/variables/user_defined.py", line 1095, in var_getattr
ERROR 08-08 20:12:53 [core.py:632]     unimplemented("UserDefined with non-function __getattr__")
ERROR 08-08 20:12:53 [core.py:632]   File "/home/ubuntu/miniforge3/lib/python3.12/site-packages/torch/_dynamo/exc.py", line 317, in unimplemented
ERROR 08-08 20:12:53 [core.py:632]     raise Unsupported(msg, case_name=case_name)
ERROR 08-08 20:12:53 [core.py:632] torch._dynamo.exc.Unsupported: UserDefined with non-function __getattr__

RunLLM · August 8, 2025, 8:54pm

It seems likely that disabling TorchDynamo (by setting TORCHDYNAMO_DISABLE=1) will cause vLLM to run in eager mode, which means you lose the performance optimizations provided by torch.compile (Dynamo). On CPU, this may result in slower inference, as vLLM relies on torch.compile to optimize model execution for better throughput and latency. However, if your model only works with Dynamo disabled, this workaround is necessary for now. The error you see is related to unsupported features in TorchDynamo for certain user-defined attributes, which can occur with some quantized or custom models on newer vLLM versions where Dynamo is enabled by default for CPU as well as GPU backends (source, source).

My understanding is that, while disabling Dynamo avoids the error and allows your model to run, you may experience reduced performance compared to a successful run with Dynamo enabled. There are no functional limitations other than this performance trade-off, and this approach is sometimes recommended as a workaround for models that hit Dynamo incompatibilities (source). Would you like more detail on how to potentially resolve the Dynamo error or on the performance impact?

Sources:

Topic		Replies	Views
RuntimeError: Int8 not supported on SM120. Use FP8 quantization instead, or run on older arch (SM < 100) NVIDIA GPU Support	3	153	November 27, 2025
Installation issue General	8	464	July 3, 2025
Some questions about torch.compile torch.compile integration	1	104	August 17, 2025
Can anyone help me? Why is this not working? It used 😭 NVIDIA GPU Support	1	1114	May 8, 2025
MoE quantization Quantization	9	1218	July 2, 2025

Running INT4 models on CPU

Related topics