Most NLP researchers prioritise the development of deep learning models over the quality of training data. The relative lack of attention results in training data picking up spurious patterns, social biases, and annotation artefacts.
Anna Rogers from the Centre of Social Data Science, University of Copenhagen, recently presented a paper underlining the importance of taking care of data as the first step towards building successful NLP models.
Data curation is the organisation and integration of data collected from multiple sources. The process involves authentication, archiving, management, preservation for retrieval, and representation.
Her paper laid down the arguments for and against data curation.
Why data curation is important
In her paper, Rogers gives the following arguments in support of data curation:
Social biases: Written text may contain all kinds of social biases based on race, gender, social status, age, and ability. Models may learn these biases, and when deployed in real-world scenarios, they may propagate and further amplify them. This puts minority groups at a significant disadvantage. It’s imperative to select data taking sociocultural characteristics into account and promote fair representation of all social groups.
Privacy: Using personally identifiable information in training data can give rise to privacy and security concerns. For example, a study showed GPT-2 memorised personal contact information even when it appeared only on a few web pages. “Deciding what should not be remembered is clearly a data curation issue,” writes Rogers.
Security: Universal adversarial triggers force models to output a certain prediction. A recently discovered phenomenon, this effect affects the training data, compromising even the robust models. Data curation can help avoid this attack.
Evaluation methodology: For NLP tasks, the test sample comes from the same distribution as the training samples. There is a possibility of the samples getting overlapped. Curation is necessary to ensure no overlapping takes place.
Progress towards NLU: With rapid scaling, we often lose track of the data on which a model is trained. Without data curation, the models may suffer from one of the following issues:
- Falling prey to common perturbations. For example, linguistic phenomenons such as negations.
- Learning spurious patterns in the data.
- Struggling to learn rare occurrences.
Arguments against data curation
Many experts believe data must be used in their natural form to give an unvarnished output. While there is no problem with this argument, Rogers said, it needs more elaboration. “In that case, the “natural” distribution may not even be what we want: e.g. if the goal is a question answering system, then the “natural” distribution of questions asked in daily life (with most questions about time and weather) will not be helpful,” wrote Rogers. She further added there is still a lot of research work that needs to be done before developers can study the world as it is.
Some developers feel their data is large enough for their training set to encompass the ‘entire data universe’. Rogers said collecting all data is impossible as it will pose legal, ethical, and practical challenges
Meanwhile, many are in favour of developing algorithmic alternatives to data curation. As per Rogers, this is a good possibility; however, having such solutions, in the current scenario, could be a complementary approach to data curation rather than completely replacing it.
A few experts believe data curation is part of the process and should not become a task big enough to forget the original purpose of developing a model. Even though the current deep learning systems are better, they still need to train within the range of the training data, Rogers said.
“A perfect dataset would provide a strong signal for each phenomenon that should be learned. That’s not how language works, so we may never be able to create something like that,” she said. While it may be difficult to achieve perfect solutions, it is always possible to improve the models.
Curation means making a decision about what to include and what to exclude. This can be a daunting task and requires a lot of interdisciplinary expertise, Rogers said.
“We do want more robust and linguistically capable models, and we do want models that do not leak sensitive data or propagate harmful stereotypes. Whether those goals would be ultimately achieved by curating large corpora or by more algorithmic solutions, in both cases we need to do a lot more data work,” writes Rogers. To achieve this goal, the developers have to overcome interdisciplinary tensions and promote truly collaborative spaces.