40 tokens/sec for Llama 3.3 70B on 2 H200 GPUs is lower than expected; for reference, vLLM benchmarks on 8xA100s show Llama 3 70B at ~700 tokens/sec, and Llama 3.1 405B at ~3100 tokens/sec, though hardware and batch size affect results. Throughput differences are not only due to parameter count—model architecture, quantization, attention implementation, and parallelism strategy also play major roles. H200s should outperform A100s, so your result suggests a possible suboptimal configuration or bottleneck (nightly benchmarks, Slack discussion).
To maximize throughput, ensure you use optimal settings: set tensor parallel size to fit your GPUs, increase max_num_batched_tokens (e.g., 8192 or higher), and tune batch size. Also, check for quantization (FP8 models are faster), and use chunked prefill if available. Llama 3.3 70B should achieve much higher throughput than 40 tokens/sec on H200s with correct tuning (performance tuning guide, optimization docs). Would you like a step-by-step checklist to diagnose and improve your setup?
Sources:
Hint: Mention @RunLLM in the post for followups.