Transformers `do_sample=False` vs SamplingParms `temperature=0` gives different results

DzmitryPihulski · November 15, 2025, 5:14pm

Hi, guys. I do not know whether to write this in issues or here. The category name says benchmarking and I think this is about that.

So I found out that running the same model with the same precision and the same generation kwargs (temp=0) with transformers .generate() and LLM .generate() gives different results.

Is that a known issue?

I will leave the code for reproducability:

import torch
import os
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer, AutoModelForCausalLM

os.environ[“VLLM_USE_V1”] = “0”
os.environ[“CUBLAS_WORKSPACE_CONFIG”]=“:4096:8”
os.environ[“VLLM_ENABLE_V1_MULTIPROCESSING”] = “0”
torch.use_deterministic_algorithms(True)

tokenizer = AutoTokenizer.from_pretrained(
“Qwen/Qwen2.5-1.5B-Instruct”,
trust_remote_code=True,
)

llm = LLM(
model=“Qwen/Qwen2.5-1.5B-Instruct”,
tokenizer=“Qwen/Qwen2.5-1.5B-Instruct”,
tensor_parallel_size=1,
trust_remote_code=True,
dtype= “float32”,
gpu_memory_utilization=0.8,
enable_chunked_prefill=False,
enforce_eager=True,
sampling_params = SamplingParams(
    temperature=1e-8,
    max_tokens=15,
    top_p=1.0,
    top_k=-1,
    seed=42,
    stop_token_ids=[tokenizer.eos_token_id],
    min_p=0.0,
    repetition_penalty=1.0,
    presence_penalty=0.0,
    frequency_penalty=0.0,
    logprobs= 10
)

prompt = "How are you?"
text = tokenizer.apply_chat_template(
                    [{"role": "user", "content": prompt}],
                    tokenize=False,
                    add_generation_prompt=True,
                    chat_template=getattr(tokenizer, "chat_template", None),
                )

outputs = llm.generate([text], sampling_params)
print(outputs[0].outputs[0].text)
### As an AI language model, I don't have feelings, but I'm

# And now the same for transformers
model = AutoModelForCausalLM.from_pretrained(
            "Qwen/Qwen2.5-1.5B-Instruct",
            torch_dtype="float32",
            device_map="auto",
            trust_remote_code=True,
        )
prompt = "How are you?"
text = tokenizer.apply_chat_template(
                    [{"role": "user", "content": prompt}],
                    tokenize=False,
                    add_generation_prompt=True,
                    chat_template=getattr(tokenizer, "chat_template", None),
                )
inputs = tokenizer(
            [text], padding=True, truncation=True, return_tensors="pt"
        ).to(model.device)
outputs = model.generate(
            **inputs,
            max_new_tokens=15,
            do_sample=False,
            pad_token_id=tokenizer.pad_token_id,
        )
input_lengths = [len(ids) for ids in inputs["input_ids"]]
generated_texts = []
for output, input_len in zip(outputs, input_lengths, strict=False):
    generated_part = output[input_len:]  # tokens generated after the prompt
    text = tokenizer.decode(generated_part, skip_special_tokens=True).strip()
    generated_texts.append(text)
print(generated_texts[0])
### "As an AI language model, I don't have feelings or emotions like humans"

)

So as you can see in the code the answers are “simular” but not identical. I think reproducability it is very important in case of benchmarking.

RunLLM · November 15, 2025, 5:14pm

Yes, this is a known and expected issue: vLLM and Hugging Face Transformers can produce different outputs even with the same model, precision, and deterministic settings (e.g., temperature=0), due to differences in kernel implementations, batching, and numerical precision. Even with all seeds and deterministic flags set, small floating-point differences or batching order can cause divergence, especially in non-greedy settings or when using different backends or hardware. This is documented in the vLLM FAQ and multiple GitHub issues—strict output matching is not guaranteed, only best-effort consistency is possible. For benchmarking, vLLM recommends using float32 and matching all sampling parameters, but even then, minor differences may persist due to these factors. See vLLM FAQ and Model Support Policy for details.

If you need bit-for-bit reproducibility, it is only possible under very specific conditions (identical hardware, software versions, and execution order), and even then, vLLM’s optimizations (like dynamic batching) can introduce nondeterminism. For benchmarking, always document your environment and settings, and expect small differences between vLLM and Transformers outputs, even for greedy decoding. Would you like a step-by-step breakdown of the technical reasons for these differences?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Two different runs give different answers General	3	345	November 26, 2025
Vllm的seed是怎么做到动态随机的？ General	19	1079	May 19, 2025
本地相同脚本并发调用时，temperature设置为0，返回结果不一致 General	1	839	July 2, 2025
Numerical Difference between vLLM logprobs and huggingface logprobs RL Integration	7	4589	April 4, 2025
Difference in Log Probabilities Between vLLM and HF Model in Identical Environment General	6	1043	March 26, 2025

Transformers `do_sample=False` vs SamplingParms `temperature=0` gives different results

Related topics