Finding a base datasets

In order to teach new downstream tasks to deep neural network models like T5, we need to find datasets that are in line with the input and output the model is used for. T5 is a text-to-text model which means that it has seen a lot of inputs, and a lot of outputs, and infers the differences between them. A sample data point from the exemplary Google’s PAWS dataset looks as follows, where the label determines whether Sentence 1 and 2 are paraphrasing each other:

Screenshot-2021-09-12-at-21-59-00.png

Some additional datasets that are suitable to teach new downstream tasks / functionality to T5 are: Google’s DiscoFuse which is a dataset that fuses two sentences into one. It consists of more than 60 million pairs. Google’s Wikisplit which focuses on the splitting of sentences and inverses the idea of DiscoFuse. This presentation at datalift #3 exemplifies how above datasets can be used for fine-tuning Google’s T5 on PAWS (including the corresponding code).

Oftentimes, datasets are already available, and as you will learn in the course of this manuscript, the compilation of a custom dataset is tedious at first. Hence, a few best practices to search if there is not already a dataset for the purpose of your task are: Google for related problems. I find results from particularly Medium.com oftentimes very helpful since there is usually some comprehensible instructions and the corresponding code provided. A more academic search the scanning of Google Scholar for related scientific articles. Here, you have to digest the academic narrative, and it frankly may involve a solid amount of reengineering (and reverting to old Python dependencies) to be able to run the code (authors also tend to not export the versions of their dependencies. Furthermore, searching Github Repositories allows you to combine the works of different companies such as Facebook AI Research or other outfits. Lastly, Kaggle.com holds many NLP datasets, too. The Google dataset searchbase has not been to useful for me in the past. Once, you have checked all these sources without success, you may want to consider to compile your own dataset. This also involves some considerations.

💁🏼‍♂️ Considerations in the Absence of Appropriate Datasets

If searching the databases presented before did not score any results, then the need may arise to compile an own dataset for a particular task that has to be solved. In this case, consider the following aspects of any raw dataset that can be used as a base dataset from which we will compile the actual dataset. These are:

In the course of this case study, we will work with mail data to be able to change the character of a text from introvert to extrovert an the reverse. Therefore, we will need a dataset composed oof mails as a base.