RunBot's math-to-text on NVIDIA NeMo Framework AutoModel

NVIDIA spammed me with ~news about their NeMo Framework AutoModel (which seems to related at least somewhat heavily to LoRA, so i picked the LoRA category for this) which claims “Day 0 model support” for popular models (i suppose meaning highly-anticipated model releases and/or all new models coming from popular ~billionaires [corporations are billionaires too]) but also, relevant here, “export to vLLM for optimized inference, with NVIDIA TensorRT-LLM export coming soon.” Tell us more, RunBot!

NVIDIA’s NeMo Framework AutoModel offers “Day 0 model support” for popular LLMs, meaning it aims to support new and highly anticipated models immediately upon release. Critically, NeMo’s AutoModel can export quantized models (e.g., with LoRA adapters) for optimized inference in vLLM, leveraging the NVIDIA TensorRT Model Optimizer. This workflow allows you to quantize HuggingFace models using NeMo/ModelOpt, export them, and then serve them efficiently with vLLM by specifying quantization="modelopt" when loading the model. Direct export to TensorRT-LLM is also planned but not yet available in vLLM as of now. For more, see the vLLM ModelOpt quantization guide.

Would you like more detail on the quantization/export workflow or compatibility specifics?

Sources:

If I were to [quote=“RunLLM, post:2, topic:637”]
quantize HuggingFace models using NeMo/ModelOpt
[/quote], where would that processing happen? On my very slow computer? Nvidia cloud? HF cloud? Somewhere-else-cloud? If cloud, can you say whether there’s a fee and how long NeMo/ModelOpt quantization takes?

Quantizing HuggingFace models with NeMo/ModelOpt is a local process: you run the quantization scripts on your own machine, not in the NVIDIA or HuggingFace cloud. The processing happens wherever you execute the code (e.g., your computer or a server you control). There is no built-in cloud service or fee for quantization—it’s open-source software you install and run yourself. The quantization time depends on your hardware and model size; no specific timing benchmarks are provided in the vLLM documentation. See the vLLM ModelOpt quantization guide for setup and usage.

Would you like more detail on hardware requirements or example quantization scripts?

Sources:

So “Day 0” kind of really means I could download a new model from HF and mash “go” to start my local quantizing process, starting the number-of-days counter till completion when i would be able to use model, assuming i have chosen well with my quantization characteristics & properties for my local inference server? And then maybe 60 to 90+ days later when local quantization has finished, assuming it’s done right on my first try, then I could serve the new “Day 0” model locally via vllm?

“Day 0” support means you can immediately download a new model from HuggingFace and start the quantization process locally using NeMo/ModelOpt, provided you have the necessary hardware and software. The actual quantization time is typically hours (not 60–90 days), but it depends on your hardware and model size. Once quantization is complete and successful, you can serve the quantized model with vLLM right away—there’s no enforced waiting period beyond your quantization runtime. See the vLLM ModelOpt quantization guide for workflow details.

Would you like more detail on expected quantization times or hardware recommendations?

Sources:

1 Like

Hey Chris. vLLM is designed to be extremely compatible with the OSS ecosystem. In particular we use the HF model format and work closely with model providers to ensure their models can run on vLLM on the day of release.

Since much of the ecosystem uses HF model format, this provides a unified export format that enables compatibility with various frameworks.

For a case of quantization, for instance, vLLM’s “llm-compressor” library uses the HF model format and saves to a format compatible with vLLM (and other engines). As a result, typically you can use, quantize, optimize, etc the models on the day of release or in the worst case a few days after (for major models). Some quantization experiments can take a few hours to run (e.g. GPTQ) and some are very quick (e.g. Fp8 quantization which uses no calibration data).

We also create many models that are pre-quantized and ready to deploy: RedHatAI (Red Hat AI). We usually turn around major model releases in a week or two. In this repo we also provide reproduction instructions to use llm-compressor to create your own quantized models

Try it out: GitHub - vllm-project/llm-compressor: Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

1 Like

Yes, vLLM is highly compatible with the HuggingFace (HF) model format, enabling immediate or near-immediate support for new model releases. You can quantize models using the llm-compressor library, which saves models in a format compatible with vLLM and other engines. Quantization times vary: FP8 quantization is very fast (no calibration data needed), while methods like GPTQ may take a few hours. Pre-quantized models are also available for quick deployment at RedHatAI on HuggingFace.

Would you like more detail on quantization workflows or how to use pre-quantized models?

Sources:

In terms of using NeMO and ModelOpt - those tools are developed by NVIDIA and are strong open-source projects. The vLLM team does not develop these tools and so you are more likely to find help by asking in their communities

Correct—NeMo and ModelOpt are developed and maintained by NVIDIA, not the vLLM team. For questions or support regarding NeMo or ModelOpt, you should consult NVIDIA’s official documentation or community forums, as vLLM only provides integration points for these tools. See the NVIDIA TensorRT Model Optimizer documentation for how vLLM interacts with ModelOpt.

Would you like more detail on how vLLM integrates with these NVIDIA tools?

Sources:

@robertshaw Thanks! I pulled a couple of the RedHatAI models after, i think, they were mentioned in an office hours a couple/few episdodes back. I’ll look at vllm’s llm-compressor too. Thank you!

Right, NeMO & ModelOpt are from Nvidia but they closed their announcement with “export to vLLM for optimized inference, with NVIDIA TensorRT-LLM export coming soon,” so I wondered what vLLM (figuring on just an answer from RunBot, so thank you again) would have to say

Since both tools do quantization, vllm’s llm-compressor and this AutoModel in Nvidia’s NeMo sound as if they’re tools for the ~same purpose. Flirty cousins if you will. @RunBot would you say AutoModel and LLMCompressor handle basically the same task? If a person is running Nvidia GPUs and VLLM, is either tool preferable for quantization in that scenario? Or would a preference between tools likely hinge on the details of the LM and/or the hardware & GPU(s) and is too much to get into as an answer for that 35000ft question

Yes, NVIDIA NeMo/ModelOpt (AutoModel) and vLLM’s llm-compressor both perform model quantization for efficient inference, and both can export models in HuggingFace format compatible with vLLM. They are indeed similar tools—“flirty cousins”—with overlapping goals.

If you are running NVIDIA GPUs and vLLM, either tool can be used for quantization. The choice often depends on your specific model, quantization scheme (e.g., INT4, FP8), and workflow preferences. ModelOpt may offer tighter integration with NVIDIA’s hardware and future TensorRT-LLM export, while llm-compressor is maintained by the vLLM team and is designed for seamless vLLM compatibility and rapid support for new models. For most users, both are viable; preference may hinge on hardware, quantization type, or community support. See the vLLM ModelOpt guide and llm-compressor repo for details.

Would you like a step-by-step comparison or recommendations for specific quantization types?

Sources: