Transformers `do_sample=False` vs SamplingParms `temperature=0` gives different results

Hi, guys. I do not know whether to write this in issues or here. The category name says benchmarking and I think this is about that.

So I found out that running the same model with the same precision and the same generation kwargs (temp=0) with transformers .generate() and LLM .generate() gives different results.

Is that a known issue?

I will leave the code for reproducability:

import torch
import os
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer, AutoModelForCausalLM

os.environ[“VLLM_USE_V1”] = “0”
os.environ[“CUBLAS_WORKSPACE_CONFIG”]=“:4096:8”
os.environ[“VLLM_ENABLE_V1_MULTIPROCESSING”] = “0”
torch.use_deterministic_algorithms(True)

tokenizer = AutoTokenizer.from_pretrained(
“Qwen/Qwen2.5-1.5B-Instruct”,
trust_remote_code=True,
)

llm = LLM(
model=“Qwen/Qwen2.5-1.5B-Instruct”,
tokenizer=“Qwen/Qwen2.5-1.5B-Instruct”,
tensor_parallel_size=1,
trust_remote_code=True,
dtype= “float32”,
gpu_memory_utilization=0.8,
enable_chunked_prefill=False,
enforce_eager=True,
sampling_params = SamplingParams(
    temperature=1e-8,
    max_tokens=15,
    top_p=1.0,
    top_k=-1,
    seed=42,
    stop_token_ids=[tokenizer.eos_token_id],
    min_p=0.0,
    repetition_penalty=1.0,
    presence_penalty=0.0,
    frequency_penalty=0.0,
    logprobs= 10
)

prompt = "How are you?"
text = tokenizer.apply_chat_template(
                    [{"role": "user", "content": prompt}],
                    tokenize=False,
                    add_generation_prompt=True,
                    chat_template=getattr(tokenizer, "chat_template", None),
                )

outputs = llm.generate([text], sampling_params)
print(outputs[0].outputs[0].text)
### As an AI language model, I don't have feelings, but I'm

# And now the same for transformers
model = AutoModelForCausalLM.from_pretrained(
            "Qwen/Qwen2.5-1.5B-Instruct",
            torch_dtype="float32",
            device_map="auto",
            trust_remote_code=True,
        )
prompt = "How are you?"
text = tokenizer.apply_chat_template(
                    [{"role": "user", "content": prompt}],
                    tokenize=False,
                    add_generation_prompt=True,
                    chat_template=getattr(tokenizer, "chat_template", None),
                )
inputs = tokenizer(
            [text], padding=True, truncation=True, return_tensors="pt"
        ).to(model.device)
outputs = model.generate(
            **inputs,
            max_new_tokens=15,
            do_sample=False,
            pad_token_id=tokenizer.pad_token_id,
        )
input_lengths = [len(ids) for ids in inputs["input_ids"]]
generated_texts = []
for output, input_len in zip(outputs, input_lengths, strict=False):
    generated_part = output[input_len:]  # tokens generated after the prompt
    text = tokenizer.decode(generated_part, skip_special_tokens=True).strip()
    generated_texts.append(text)
print(generated_texts[0])
### "As an AI language model, I don't have feelings or emotions like humans"

)

So as you can see in the code the answers are “simular” but not identical. I think reproducability it is very important in case of benchmarking.

Yes, this is a known and expected issue: vLLM and Hugging Face Transformers can produce different outputs even with the same model, precision, and deterministic settings (e.g., temperature=0), due to differences in kernel implementations, batching, and numerical precision. Even with all seeds and deterministic flags set, small floating-point differences or batching order can cause divergence, especially in non-greedy settings or when using different backends or hardware. This is documented in the vLLM FAQ and multiple GitHub issues—strict output matching is not guaranteed, only best-effort consistency is possible. For benchmarking, vLLM recommends using float32 and matching all sampling parameters, but even then, minor differences may persist due to these factors. See vLLM FAQ and Model Support Policy for details.

If you need bit-for-bit reproducibility, it is only possible under very specific conditions (identical hardware, software versions, and execution order), and even then, vLLM’s optimizations (like dynamic batching) can introduce nondeterminism. For benchmarking, always document your environment and settings, and expect small differences between vLLM and Transformers outputs, even for greedy decoding. Would you like a step-by-step breakdown of the technical reasons for these differences?

Sources:

Hint: Mention @RunLLM in the post for followups.

1 Like