Running INT4 models on CPU

Hi, I’m attempting to run the RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 model on CPU using vLLM v0.10.0. I’m encountering a TorchDynamo related error that I hadn’t seen in earlier versions, possibly because Dynamo wasn’t enabled before?

The model runs successfully when I set TORCHDYNAMO_DISABLE=1. Are there any performance trade-offs or limitations associated with disabling Dynamo in this context?

ERROR 08-08 20:12:53 [core.py:632]   File "/home/ubuntu/miniforge3/lib/python3.12/site-packages/torch/_dynamo/variables/user_defined.py", line 1095, in var_getattr
ERROR 08-08 20:12:53 [core.py:632]     unimplemented("UserDefined with non-function __getattr__")
ERROR 08-08 20:12:53 [core.py:632]   File "/home/ubuntu/miniforge3/lib/python3.12/site-packages/torch/_dynamo/exc.py", line 317, in unimplemented
ERROR 08-08 20:12:53 [core.py:632]     raise Unsupported(msg, case_name=case_name)
ERROR 08-08 20:12:53 [core.py:632] torch._dynamo.exc.Unsupported: UserDefined with non-function __getattr__

It seems likely that disabling TorchDynamo (by setting TORCHDYNAMO_DISABLE=1) will cause vLLM to run in eager mode, which means you lose the performance optimizations provided by torch.compile (Dynamo). On CPU, this may result in slower inference, as vLLM relies on torch.compile to optimize model execution for better throughput and latency. However, if your model only works with Dynamo disabled, this workaround is necessary for now. The error you see is related to unsupported features in TorchDynamo for certain user-defined attributes, which can occur with some quantized or custom models on newer vLLM versions where Dynamo is enabled by default for CPU as well as GPU backends (source, source).

My understanding is that, while disabling Dynamo avoids the error and allows your model to run, you may experience reduced performance compared to a successful run with Dynamo enabled. There are no functional limitations other than this performance trade-off, and this approach is sometimes recommended as a workaround for models that hit Dynamo incompatibilities (source). Would you like more detail on how to potentially resolve the Dynamo error or on the performance impact?

Sources: