Llama 3.3 70B very slow

Hi,

I have been testing Llama 3.3 70B on 2 H200 GPUs, and I found it to be very slow. I can get around 40 tokens per second, compared to gpt-oss-120B (200 tokens) or GLM-4.5-Air FP8 (120 tokens).

I am wondering if this is number realistic. Is the difference only due to the number of active parameters? If someone has experience with these models, could you tell me if the throughput seems OK or low compared to your setup?

Thanks!

1 Like

40 tokens/sec for Llama 3.3 70B on 2 H200 GPUs is lower than expected; for reference, vLLM benchmarks on 8xA100s show Llama 3 70B at ~700 tokens/sec, and Llama 3.1 405B at ~3100 tokens/sec, though hardware and batch size affect results. Throughput differences are not only due to parameter count—model architecture, quantization, attention implementation, and parallelism strategy also play major roles. H200s should outperform A100s, so your result suggests a possible suboptimal configuration or bottleneck (nightly benchmarks, Slack discussion).

To maximize throughput, ensure you use optimal settings: set tensor parallel size to fit your GPUs, increase max_num_batched_tokens (e.g., 8192 or higher), and tune batch size. Also, check for quantization (FP8 models are faster), and use chunked prefill if available. Llama 3.3 70B should achieve much higher throughput than 40 tokens/sec on H200s with correct tuning (performance tuning guide, optimization docs). Would you like a step-by-step checklist to diagnose and improve your setup?

Sources:

Hint: Mention @RunLLM in the post for followups.

Sorry, I forgot to mention that I am talking about a single user scenario. Also, looking at Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding | NVIDIA Technical Blog seems to suggest that 40-50 is indeed the limit. But please correct me if I’m wrong.

Can share the config / vLLM parameters you’re using?

Of course.

vllm serve meta-llama/Llama-3.3-70B-Instruct --enable-prefix-caching --max-model-len 128000 --gpu-memory-utilization 0.95 --tensor-paral
lel-size 2 --max-num-batched-tokens 8192 --max-num-seqs 64 --max-seq-len-to-capture 8192 --host localhost

Even a H200 lacks the memory necessary to serve the model, so it had to be run on two.