I have been testing Llama 3.3 70B on 2 H200 GPUs, and I found it to be very slow. I can get around 40 tokens per second, compared to gpt-oss-120B (200 tokens) or GLM-4.5-Air FP8 (120 tokens).
I am wondering if this is number realistic. Is the difference only due to the number of active parameters? If someone has experience with these models, could you tell me if the throughput seems OK or low compared to your setup?
40 tokens/sec for Llama 3.3 70B on 2 H200 GPUs is lower than expected; for reference, vLLM benchmarks on 8xA100s show Llama 3 70B at ~700 tokens/sec, and Llama 3.1 405B at ~3100 tokens/sec, though hardware and batch size affect results. Throughput differences are not only due to parameter count—model architecture, quantization, attention implementation, and parallelism strategy also play major roles. H200s should outperform A100s, so your result suggests a possible suboptimal configuration or bottleneck (nightly benchmarks, Slack discussion).
To maximize throughput, ensure you use optimal settings: set tensor parallel size to fit your GPUs, increase max_num_batched_tokens (e.g., 8192 or higher), and tune batch size. Also, check for quantization (FP8 models are faster), and use chunked prefill if available. Llama 3.3 70B should achieve much higher throughput than 40 tokens/sec on H200s with correct tuning (performance tuning guide, optimization docs). Would you like a step-by-step checklist to diagnose and improve your setup?