Hello, a question about the implementation of ShareGPT dataset. I can see from the code that only the first 2 lines of the dialog are always used.
Whereas the dialogues in the dataset mostly consist of multiple replicas. In addition, the first replica does not always belong to a human, for example:
{
“from”: “gpt”,
“value”: “As the sun dipped below the horizon, Captain Roran Grimbeard … The story of the Shattered Realms was just beginning.”
},
{
“from”: “human”,
“value”: “OK. So can we expand this more? By that I mean can we make it longer by adding additional details … This is all to fill the first chapter or two of the book, so we need to greatly expand the text.”
},
{
“from”: “gpt”,
“value”: “Roran stood on the deck of the Falcon’s Fury, … \n\nRoran found himself drawn to”
},
{
“from”: “human”,
“value”: “can you continue?”
},
{
“from”: “gpt”,
…
It seems that the input/output ratios obtained by using the dataset in this way reflect very poorly the actual ratios in the llm chatbot scenario. Is there a reason why the dataset is being used in this way, and should we expect a more complete implementation of its use?