Two different runs give different answers

ML_Nau · November 26, 2025, 2:27pm

I am running the same model using vllm and getting different accuracies like 0.823 and 0.236. Both runs are on A100
Init code:
model_max_length = 2048
llm = LLM(model=model, tensor_parallel_size=torch.cuda.device_count(), dtype=args.dtype, trust_remote_code=True)
tokenizer = llm.get_tokenizer()
sampling_params = SamplingParams(temperature=0, top_p=1,max_tokens=model_max_length, stop=[tokenizer.eos_token])
Generation code:
outputs = llm.generate(question, sampling_params)

RunLLM · November 26, 2025, 2:27pm

vLLM does not guarantee reproducible results by default, even with the same model, prompt, and parameters. Variations can occur due to non-deterministic Torch operations, batching differences, hardware, or precision settings. For reproducibility, you must (1) set a fixed seed, (2) disable multiprocessing with os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0", and (3) ensure you run on the same hardware and vLLM version. Even then, exact reproducibility is only possible in offline (not server) mode and may still be affected by hardware or library differences. See the official reproducibility guide for details and a code example: Reproducibility — vLLM.

If you do not set these options, vLLM may produce different outputs (and thus different accuracies) across runs, even with temperature=0. This is expected behavior due to the reasons above. For more, see FAQ: Can the output of a prompt vary across runs in vLLM?.

Would you like a step-by-step guide to make your runs reproducible?

Sources:

Hint: Mention @RunLLM in the post for followups.

ML_Nau · November 26, 2025, 2:29pm

@RunLLM But this is a huge variation

RunLLM · November 26, 2025, 2:29pm

Yes, such large variation is possible in vLLM if reproducibility settings are not enforced. Even with temperature=0, differences in batching, hardware, or precision can cause outputs—and thus accuracy—to change significantly between runs, especially if concurrent requests or different batch sizes are involved. This is a known limitation; see FAQ and Reproducibility docs.

Would you like detailed steps to minimize or eliminate this variation?

Sources:

FAQ: Can the output of a prompt vary across runs in vLLM?

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Transformers `do_sample=False` vs SamplingParms `temperature=0` gives different results Benchmarking	1	116	November 15, 2025
vLLM output vs Ollama General	8	563	April 10, 2025
Vllm的seed是怎么做到动态随机的？ General	19	530	May 19, 2025
本地相同脚本并发调用时，temperature设置为0，返回结果不一致 General	1	579	July 2, 2025
Difference in Log Probabilities Between vLLM and HF Model in Identical Environment General	6	780	March 26, 2025

Two different runs give different answers

Related topics