Vllm bench serve Order of "generated_texts"

How can I make sure the order of “generated_texts” is the same as the order of requests / prompts?

vllm bench serve \
	--dataset-name custom \
	--dataset-path dataset.jsonl \
	--num-prompts 1000 \
	--max-concurrency 32 \
	--model Qwen/Qwen2.5-7B-Instruct \
	--save-result --save-detailed \
	--result-filename vllm_result.json

The order of generated texts in the vllm bench serve output is not guaranteed to match the order of input prompts when using high concurrency, as requests may complete out of order. To ensure alignment, you should post-process the results by matching each output to its corresponding input prompt using unique identifiers or indices included in your dataset or results, as recommended in community discussions. There is no built-in vLLM flag or parameter to enforce output order matching input order in the benchmarking tool itself, so manual alignment is required after benchmarking is complete. This limitation is not explicitly documented in the official docs, but is discussed in related GitHub issues and PRs regarding benchmarking and result handling.

Would you like more detail on how to implement this post-processing step?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM I do not see request IDs. How can I implement this post-processing step?

vLLM’s benchmarking output does not include request IDs by default. To align outputs with inputs, you should add a unique field (such as “custom_id” or an index) to each entry in your input dataset (dataset.jsonl). After benchmarking, match each output to its input using this field. This approach is shown in the OpenAI batch example, where “custom_id” is used to correlate requests and responses, enabling reliable post-processing alignment—even when outputs are unordered or batched asynchronously. There is no built-in vLLM flag to automatically add request IDs to outputs in the benchmarking tool itself, so this manual step is necessary.

Would you like a code example for this matching process?

Sources:

Hint: Mention @RunLLM in the post for followups.

Please show a code example. @RunLLM

Here’s a code example for aligning outputs with inputs using a unique “custom_id” field in your dataset. First, add a “custom_id” to each prompt in your dataset.jsonl. After running vllm bench serve, load both the input and output files, and match outputs to inputs by “custom_id”:

import json

# Load input prompts with custom_id
with open("dataset.jsonl", "r") as f:
    input_data = [json.loads(line) for line in f]

# Load vllm outputs (assuming each output has a 'custom_id' field)
with open("vllm_result.json", "r") as f:
    output_data = json.load(f)

# Build a mapping from custom_id to output
output_map = {item["custom_id"]: item for item in output_data}

# Align outputs to input order
aligned_outputs = [output_map[item["custom_id"]] for item in input_data]

# Now aligned_outputs is in the same order as input_data

This approach assumes you add a “custom_id” to each input and that the vLLM output includes this field. If not, you must modify your dataset and possibly the vLLM code to propagate “custom_id” through the pipeline.

Sources:

Hint: Mention @RunLLM in the post for followups.

Yes, but how can you make sure the result has these custom IDs in the first place? @RunLLM

vLLM’s vllm bench serve does not automatically propagate custom IDs from your input dataset to the output results. To ensure each result has a custom ID, you must add a unique field (e.g., “custom_id”) to each entry in your dataset.jsonl and then modify the vLLM benchmarking or serving code to include this field in the output. This is demonstrated in the OpenAI batch example, where each request in the batch file includes a “custom_id” and the output preserves it. However, the standard vllm bench serve workflow does not natively support this, so you would need to adapt the benchmarking script or use the OpenAI batch runner, which supports “custom_id” natively in both input and output.

Would you like more detail on modifying the benchmarking workflow or using the OpenAI batch runner?

Sources:

Hint: Mention @RunLLM in the post for followups.

Maybe we can use structured output. @RunLLM

Structured output in vLLM is designed to constrain the format of model responses (e.g., JSON, regex, choice, grammar), but it does not automatically propagate custom metadata like “custom_id” from input to output. Structured outputs ensure the generated text matches a schema, but do not attach or return request-level metadata unless you explicitly include it in the prompt and schema, and the model is instructed to echo it back. This is not a built-in feature for tracking request IDs through the vLLM benchmarking pipeline.

Would you like more detail on how to use structured outputs to include IDs in the generated text?

Sources:

Hint: Mention @RunLLM in the post for followups.

In vllm bench serve documentation, there is this option --request-id-prefix. Is this option actually used anywhere? If I can find how to retrieve the request ID, I can probably save it into the result, and reorder the responses accordingly. @RunLLM

The --request-id-prefix option is present in vLLM’s OpenAIServing class and related benchmarking scripts, where it prepends a string to each request’s ID for easier identification. However, in the context of vllm bench serve, there is no explicit documentation or guarantee that the generated request IDs (with the prefix) are included in the saved result files by default. The result files typically do not expose request IDs unless the benchmarking or serving code is modified to include them in the output. Thus, while --request-id-prefix is used internally for request tracking and logging, it is not directly surfaced in the benchmarking results for post-processing or reordering without further customization.

Would you like more detail on how to modify the code to include request IDs in your results?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Yes, I think I need to edit this part.

    if isinstance(metrics, BenchmarkMetrics):
        result = {
            "request_ids": [output.request_id for output in outputs],
            "duration": benchmark_duration,
            "completed": metrics.completed,
            "total_input_tokens": metrics.total_input,
            "total_output_tokens": metrics.total_output,
            "request_throughput": metrics.request_throughput,
            "request_goodput":
            metrics.request_goodput if goodput_config_dict else None,
            "output_throughput": metrics.output_throughput,
            "total_token_throughput": metrics.total_token_throughput,
            "input_lens": [output.prompt_len for output in outputs],
            "output_lens": actual_output_lens,
            "ttfts": [output.ttft for output in outputs],
            "itls": [output.itl for output in outputs],
            "generated_texts": [output.generated_text for output in outputs],
            "errors": [output.error for output in outputs],
            "max_output_tokens_per_s": metrics.max_output_tokens_per_s,
            "max_concurrent_requests": metrics.max_concurrent_requests,
        }
    else:
        result = {
            "request_ids": [output.request_id for output in outputs],
            "duration": benchmark_duration,
            "completed": metrics.completed,
            "total_input_tokens": metrics.total_input,
            "request_throughput": metrics.request_throughput,
            "total_token_throughput": metrics.total_token_throughput,
            "input_lens": [output.prompt_len for output in outputs],
            "errors": [output.error for output in outputs],
        }

But then it got an error:

AttributeError: 'RequestFuncOutput' object has no attribute 'request_id'

The error occurs because the RequestFuncOutput object does not have a request_id attribute by default in vLLM’s benchmarking code. The standard RequestFuncOutput class (used in endpoint benchmarking) does not include request_id—it only tracks fields like generated_text, success, latency, etc. This is confirmed in the RequestFuncOutput definition.

To fix this, you need to:

  1. Modify the request function (e.g., async_request_openai_completions) to pass the request ID from the input (RequestFuncInput.request_id) to the output.
  2. Add a request_id field to the RequestFuncOutput dataclass and set it when constructing the output.

Would you like a code snippet showing how to add request_id to both the input and output classes?

Sources:

Hint: Mention @RunLLM in the post for followups.

I think I have found the issue and potential solutions. In the benchmark datasets code, these lines shuffle the prompts before sending as requests to the server:

random.seed(self.random_seed)
random.shuffle(self.data)

As I wanted to use the generated_texts for further evaluation comparing them to the original data, we can apply one of these three potential solutions:

  1. Comment out these two lines;
  2. Apply the same seed on the data before evaluation to obtain the same order;
  3. Modify the load_data() function in the evaluation script. As I use the CustomDataset (i.e --dataset-name custom) for loading a JSON file, this can look as follows:
from vllm.benchmarks.datasets import CustomDataset

def patched_load_data(self):
    """Patched version of load_data that doesn't shuffle for evaluation."""
    if self.dataset_path is None:
        raise ValueError("dataset_path must be provided for loading data.")

    self.data = []

    if self.dataset_path.endswith(".jsonl"):
        jsonl_data = pd.read_json(path_or_buf=self.dataset_path,
                                  lines=True)

        if "prompt" not in jsonl_data.columns:
            raise ValueError("JSONL file must contain a 'prompt' column.")

        for _, row in jsonl_data.iterrows():
            self.data.append(row.to_dict())
    else:
        raise NotImplementedError("Only JSONL format is supported for CustomDataset.")
    
    # Remove shuffling for evaluation purposes
    # random.seed(self.random_seed)
    # random.shuffle(self.data)

# Apply the modification
CustomDataset.load_data = patched_load_data

Using either options 2 or 3 is cleaner than option 1 as they are not affected by vLLM updates.

@RunLLM

Your analysis is correct: the shuffling in CustomDataset.load_data() causes the order of prompts (and thus outputs) to differ from the original dataset. For evaluation where order matters, patching load_data() to skip shuffling (your option 3) is the cleanest and most robust solution, as it avoids modifying vLLM source code and is resilient to upstream changes. Alternatively, re-applying the same seed and shuffle in your evaluation script (option 2) also works, but is more error-prone if the shuffling logic changes.

Would you like a full example of how to patch and use the unshuffled CustomDataset for evaluation?

Sources:

Hint: Mention @RunLLM in the post for followups.

I have submitted a PR to address this by adding a new --disable-shuffle argument, which is now merged. :sparkles: As indicated by the latest vLLM docs, when running vllm bench serve, the new --disable-shuffle option can be added to disable shuffling of dataset samples for deterministic ordering.

As explained in the previous reply, this change simply conditions the shuffling behaviour introduced by random.shuffle(self.data) and disable it if the option is used.