Llama 3.3 70B very slow

GuillaumeGTheodo · December 11, 2025, 3:45pm

Hello David,

I would try running a quantized version of Llama-3.3-70B-Instruct (see nvidia/Llama-3.3-70B-Instruct-FP8 or RedHatAI/Llama-3.3-70B-Instruct-quantized.w8a8) that should reduce the memory footprint by x2,~x3 compared to the base, unquantized model provided by Meta (that stands at ~140GB)

Unlike gpt-oss-120b, which is an already aggressively quantized – MXFP4 – MoE (lower number of weights activated at the same time) LLama should have lower throughput in my experience, but not by that much.

You can also look into speculative decoding to improve ITL, see :

• yuhuili/EAGLE3-LLaMA3.3-Instruct-70B

In my experience, you can go much higher with max-num-batched-tokens the more you add H200.

There are a lot of things you can tweak to improve performance, but that should be a good starting point

Topic		Replies	Views
Vllm throughput less on 7B in comparison to 32B General	1	164	June 25, 2025
Weird benchmarking results regardin QwenVL/30B/8B/4B [solved - moe xd] General	4	792	October 21, 2025
Problem with Qwen2.5-VL-7b General	3	1442	July 9, 2025
The service carrying capacity and computing efficiency issues of vllm General	0	42	July 16, 2025
Help needed with vLLM Qwen3-32B inference performance tuning (long input slow) General	1	1385	May 26, 2025

Llama 3.3 70B very slow

Related topics