I think I have found the issue and potential solutions. In the benchmark datasets code, these lines shuffle the prompts before sending as requests to the server:
random.seed(self.random_seed)
random.shuffle(self.data)
As I wanted to use the generated_texts for further evaluation comparing them to the original data, we can apply one of these three potential solutions:
- Comment out these two lines;
- Apply the same
seedon the data before evaluation to obtain the same order; - Modify the
load_data()function in the evaluation script. As I use theCustomDataset(i.e--dataset-name custom) for loading a JSON file, this can look as follows:
from vllm.benchmarks.datasets import CustomDataset
def patched_load_data(self):
"""Patched version of load_data that doesn't shuffle for evaluation."""
if self.dataset_path is None:
raise ValueError("dataset_path must be provided for loading data.")
self.data = []
if self.dataset_path.endswith(".jsonl"):
jsonl_data = pd.read_json(path_or_buf=self.dataset_path,
lines=True)
if "prompt" not in jsonl_data.columns:
raise ValueError("JSONL file must contain a 'prompt' column.")
for _, row in jsonl_data.iterrows():
self.data.append(row.to_dict())
else:
raise NotImplementedError("Only JSONL format is supported for CustomDataset.")
# Remove shuffling for evaluation purposes
# random.seed(self.random_seed)
# random.shuffle(self.data)
# Apply the modification
CustomDataset.load_data = patched_load_data
Using either options 2 or 3 is cleaner than option 1 as they are not affected by vLLM updates.