Some questions about torch.compile

Hi vLLM team,

The article “PyTorch + vLLM = :heart: shows 1.05x–1.9x speedups with torch.compile for models like Llama4/Qwen3/Gemma3. To better understand:

  1. What benchmark setup (hardware, input shape) was used?

  2. Was this measured by comparing vLLM with vs. without torch.compile enabled?

  3. Could you share a reference script/command for reproducing these tests?

Thanks for your insights! This will greatly help the community validate the optimizations.

The referenced speedups (1.05x–1.9x) for torch.compile in vLLM were measured using the vLLM benchmark suite, specifically with models like Llama3, Llama4, and Mixtral on NVIDIA H100 and AMD MI300x hardware. The benchmarks compare vLLM with torch.compile enabled (the V1 engine, where torch.compile is on by default) versus vLLM without torch.compile (V0 engine or with --enforce-eager). The vllm-benchmark workflow is used for validation, and results are published on the vLLM benchmark dashboard for transparency and reproducibility. Input shapes typically use batch sizes and sequence lengths relevant to LLM serving (e.g., input/output lengths of 128–256 tokens, batch sizes up to 256+), but exact shapes can be found in the benchmark scripts.

To reproduce, you can run:

python benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B --num-scheduler-steps 64
# With torch.compile (V1, default)
# To disable torch.compile: add --enforce-eager

For more details, see the official torch.compile integration docs and release/benchmark instructions. Would you like a step-by-step breakdown or more details on hardware or input shape specifics?

Sources: