Two different runs give different answers

I am running the same model using vllm and getting different accuracies like 0.823 and 0.236. Both runs are on A100
Init code:
model_max_length = 2048
llm = LLM(model=model, tensor_parallel_size=torch.cuda.device_count(), dtype=args.dtype, trust_remote_code=True)
tokenizer = llm.get_tokenizer()
sampling_params = SamplingParams(temperature=0, top_p=1,max_tokens=model_max_length, stop=[tokenizer.eos_token])
Generation code:
outputs = llm.generate(question, sampling_params)

vLLM does not guarantee reproducible results by default, even with the same model, prompt, and parameters. Variations can occur due to non-deterministic Torch operations, batching differences, hardware, or precision settings. For reproducibility, you must (1) set a fixed seed, (2) disable multiprocessing with os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0", and (3) ensure you run on the same hardware and vLLM version. Even then, exact reproducibility is only possible in offline (not server) mode and may still be affected by hardware or library differences. See the official reproducibility guide for details and a code example: Reproducibility — vLLM.

If you do not set these options, vLLM may produce different outputs (and thus different accuracies) across runs, even with temperature=0. This is expected behavior due to the reasons above. For more, see FAQ: Can the output of a prompt vary across runs in vLLM?.

Would you like a step-by-step guide to make your runs reproducible?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM But this is a huge variation

Yes, such large variation is possible in vLLM if reproducibility settings are not enforced. Even with temperature=0, differences in batching, hardware, or precision can cause outputs—and thus accuracy—to change significantly between runs, especially if concurrent requests or different batch sizes are involved. This is a known limitation; see FAQ and Reproducibility docs.

Would you like detailed steps to minimize or eliminate this variation?

Sources:

Hint: Mention @RunLLM in the post for followups.