ShareGPT implementation

Hello, a question about the implementation of ShareGPT dataset. I can see from the code that only the first 2 lines of the dialog are always used.

Whereas the dialogues in the dataset mostly consist of multiple replicas. In addition, the first replica does not always belong to a human, for example:

{
“from”: “gpt”,
“value”: “As the sun dipped below the horizon, Captain Roran Grimbeard … The story of the Shattered Realms was just beginning.”
},
{
“from”: “human”,
“value”: “OK. So can we expand this more? By that I mean can we make it longer by adding additional details … This is all to fill the first chapter or two of the book, so we need to greatly expand the text.”
},
{
“from”: “gpt”,
“value”: “Roran stood on the deck of the Falcon’s Fury, … \n\nRoran found himself drawn to”
},
{
“from”: “human”,
“value”: “can you continue?”
},
{
“from”: “gpt”,

It seems that the input/output ratios obtained by using the dataset in this way reflect very poorly the actual ratios in the llm chatbot scenario. Is there a reason why the dataset is being used in this way, and should we expect a more complete implementation of its use?

You are correct: the current ShareGPTDataset implementation in vLLM benchmarks only uses the first two conversation turns (regardless of speaker), which does not reflect the multi-turn nature of real chat scenarios. This is a known limitation and has been discussed as an area for improvement, since sampling only the first two turns can lead to unrealistic input/output ratios and overly optimistic benchmark results. There is an open feature request to support sampling longer, variable-length conversations to better match real-world usage, but as of now, this is not implemented in the main codebase (see discussion; source code reference).

If you need more realistic benchmarking, you may need to modify the dataset sampling logic yourself to include more turns per sample. Would you like a step-by-step guide on how to do this or more detail on the current implementation?

Sources: