Llama 3.3 70B very slow

RunLLM · October 10, 2025, 12:34pm

40 tokens/sec for Llama 3.3 70B on 2 H200 GPUs is lower than expected; for reference, vLLM benchmarks on 8xA100s show Llama 3 70B at ~700 tokens/sec, and Llama 3.1 405B at ~3100 tokens/sec, though hardware and batch size affect results. Throughput differences are not only due to parameter count—model architecture, quantization, attention implementation, and parallelism strategy also play major roles. H200s should outperform A100s, so your result suggests a possible suboptimal configuration or bottleneck (nightly benchmarks, Slack discussion).

To maximize throughput, ensure you use optimal settings: set tensor parallel size to fit your GPUs, increase max_num_batched_tokens (e.g., 8192 or higher), and tune batch size. Also, check for quantization (FP8 models are faster), and use chunked prefill if available. Llama 3.3 70B should achieve much higher throughput than 40 tokens/sec on H200s with correct tuning (performance tuning guide, optimization docs). Would you like a step-by-step checklist to diagnose and improve your setup?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Vllm throughput less on 7B in comparison to 32B General	1	160	June 25, 2025
Weird benchmarking results regardin QwenVL/30B/8B/4B [solved - moe xd] General	4	792	October 21, 2025
Problem with Qwen2.5-VL-7b General	3	1441	July 9, 2025
The service carrying capacity and computing efficiency issues of vllm General	0	42	July 16, 2025
Help needed with vLLM Qwen3-32B inference performance tuning (long input slow) General	1	1381	May 26, 2025

Llama 3.3 70B very slow

Related topics