Hello David,
I would try running a quantized version of Llama-3.3-70B-Instruct (see nvidia/Llama-3.3-70B-Instruct-FP8 or RedHatAI/Llama-3.3-70B-Instruct-quantized.w8a8) that should reduce the memory footprint by x2,~x3 compared to the base, unquantized model provided by Meta (that stands at ~140GB)
Unlike gpt-oss-120b, which is an already aggressively quantized – MXFP4 – MoE (lower number of weights activated at the same time) LLama should have lower throughput in my experience, but not by that much.
You can also look into speculative decoding to improve ITL, see :
• yuhuili/EAGLE3-LLaMA3.3-Instruct-70B
In my experience, you can go much higher with max-num-batched-tokens the more you add H200.
There are a lot of things you can tweak to improve performance, but that should be a good starting point ![]()