ShareGPT implementation

dimen2000 · May 22, 2025, 6:57am

Hello, a question about the implementation of ShareGPT dataset. I can see from the code that only the first 2 lines of the dialog are always used.

github.com/vllm-project/vllm

benchmarks/benchmark_dataset.py

main


      
          prompt, completion = (
              entry["conversations"][0]["value"],
              entry["conversations"][1]["value"],
          )

Whereas the dialogues in the dataset mostly consist of multiple replicas. In addition, the first replica does not always belong to a human, for example:

{
“from”: “gpt”,
“value”: “As the sun dipped below the horizon, Captain Roran Grimbeard … The story of the Shattered Realms was just beginning.”
},
{
“from”: “human”,
“value”: “OK. So can we expand this more? By that I mean can we make it longer by adding additional details … This is all to fill the first chapter or two of the book, so we need to greatly expand the text.”
},
{
“from”: “gpt”,
“value”: “Roran stood on the deck of the Falcon’s Fury, … \n\nRoran found himself drawn to”
},
{
“from”: “human”,
“value”: “can you continue?”
},
{
“from”: “gpt”,
…

It seems that the input/output ratios obtained by using the dataset in this way reflect very poorly the actual ratios in the llm chatbot scenario. Is there a reason why the dataset is being used in this way, and should we expect a more complete implementation of its use?

RunLLM · May 22, 2025, 6:58am

You are correct: the current ShareGPTDataset implementation in vLLM benchmarks only uses the first two conversation turns (regardless of speaker), which does not reflect the multi-turn nature of real chat scenarios. This is a known limitation and has been discussed as an area for improvement, since sampling only the first two turns can lead to unrealistic input/output ratios and overly optimistic benchmark results. There is an open feature request to support sampling longer, variable-length conversations to better match real-world usage, but as of now, this is not implemented in the main codebase (see discussion; source code reference).

If you need more realistic benchmarking, you may need to modify the dataset sampling logic yourself to include more turns per sample. Would you like a step-by-step guide on how to do this or more detail on the current implementation?

Sources:

Topic		Replies	Views
Value error Conversation roles must alternate General	5	131	August 14, 2025
Vllm bench serve Order of "generated_texts" Benchmarking	16	16	October 6, 2025
Numerical Difference between vLLM logprobs and huggingface logprobs RL Integration	7	3914	April 4, 2025
Grammar CPU bound performance Structured Outputs	9	290	April 29, 2025
使用以下2种方式，获得的结果有很大差异 Benchmarking	50	626	July 25, 2025

ShareGPT implementation

Related topics