Llama 3.3 70B very slow

DavidNemeskey · October 10, 2025, 12:34pm

Hi,

I have been testing Llama 3.3 70B on 2 H200 GPUs, and I found it to be very slow. I can get around 40 tokens per second, compared to gpt-oss-120B (200 tokens) or GLM-4.5-Air FP8 (120 tokens).

I am wondering if this is number realistic. Is the difference only due to the number of active parameters? If someone has experience with these models, could you tell me if the throughput seems OK or low compared to your setup?

Thanks!

RunLLM · October 10, 2025, 12:34pm

40 tokens/sec for Llama 3.3 70B on 2 H200 GPUs is lower than expected; for reference, vLLM benchmarks on 8xA100s show Llama 3 70B at ~700 tokens/sec, and Llama 3.1 405B at ~3100 tokens/sec, though hardware and batch size affect results. Throughput differences are not only due to parameter count—model architecture, quantization, attention implementation, and parallelism strategy also play major roles. H200s should outperform A100s, so your result suggests a possible suboptimal configuration or bottleneck (nightly benchmarks, Slack discussion).

To maximize throughput, ensure you use optimal settings: set tensor parallel size to fit your GPUs, increase max_num_batched_tokens (e.g., 8192 or higher), and tune batch size. Also, check for quantization (FP8 models are faster), and use chunked prefill if available. Llama 3.3 70B should achieve much higher throughput than 40 tokens/sec on H200s with correct tuning (performance tuning guide, optimization docs). Would you like a step-by-step checklist to diagnose and improve your setup?

Sources:

Hint: Mention @RunLLM in the post for followups.

DavidNemeskey · October 10, 2025, 12:46pm

Sorry, I forgot to mention that I am talking about a single user scenario. Also, looking at Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding | NVIDIA Technical Blog seems to suggest that 40-50 is indeed the limit. But please correct me if I’m wrong.

hackyroot · October 12, 2025, 6:20am

Can share the config / vLLM parameters you’re using?

DavidNemeskey · October 13, 2025, 11:22am

Of course.

vllm serve meta-llama/Llama-3.3-70B-Instruct --enable-prefix-caching --max-model-len 128000 --gpu-memory-utilization 0.95 --tensor-paral
lel-size 2 --max-num-batched-tokens 8192 --max-num-seqs 64 --max-seq-len-to-capture 8192 --host localhost

Even a H200 lacks the memory necessary to serve the model, so it had to be run on two.

Topic		Replies	Views
Vllm throughput less on 7B in comparison to 32B General	1	49	June 25, 2025
Problem with Qwen2.5-VL-7b General	3	728	July 9, 2025
Weird benchmarking results regardin QwenVL/30B/8B/4B [solved - moe xd] General	4	31	October 21, 2025
The service carrying capacity and computing efficiency issues of vllm General	0	24	July 16, 2025
Help needed with vLLM Qwen3-32B inference performance tuning (long input slow) General	1	779	May 26, 2025

Llama 3.3 70B very slow

Related topics