Google AI has released GoEmotions. It is a human-annotated dataset of 58,000 Reddit comments extracted from popular English-language subreddits and labelled with 27 emotion categories; it includes 12 positive, 11 negative, and 4 ambiguous emotion categories and 1 “neutral” category. The tech giant designed the GoEmotions taxonomy considering both psychology and data applicability.
The earlier datasets used for emotional analysis were too small and used just six basic emotions — anger, surprise, disgust, joy, fear, and sadness.
The GoEmotions taxonomy wants to
- provide the greatest coverage of the emotions expressed in Reddit data
- provide the best coverage of types of emotional expressions
- limit the overall number of emotions and their overlap
Reddit was chosen as the resource for generating this dataset as it offers publicly available, large volumes of content with direct user-to-user conversation.
What is GoEmotions exactly?
The following steps were involved while building it:
- Selecting and curating Reddit comments – It uses a Reddit data dump originating in the Reddit-data-tools project. Google used Reddit comments from 2005 to January 2019, sourced from subreddits with at least 10,000 comments. It excluded deleted and non-English comments.
But Reddit comes with its own problems as well. It has a known demographic bias leaning towards young male users, which is not reflective of a globally diverse population. It also has a leaning towards toxic, offensive language. The researchers identified harmful comments using predefined terms for offensive, adult and vulgar content and for identity and religion to solve this problem. It was used for data filtering and masking.
To reduce profanity, the researchers removed subreddits that were not safe for work and those where 10% of comments had offensive and vulgar tokens. They preserved the vulgar comments as they helped to learn about negative emotions. They reviewed identity comments and removed offensive ones towards a particular ethnicity, gender, sexual orientation, or disability.
- Filtering of length – The researchers applied Natural Language Toolkit (NLTK’s) word tokenizer and selected comments 3-30 tokens long, including punctuation. They performed downsampling, capped by the number of comments with the median token count. They then run an emotion prediction model trained on a pilot batch of 2,200 annotated examples. They exclude subreddits consisting of more than 30% neutral comments or less than 20% negative, positive, or ambiguous comments.
They assigned emotion to each comment using the pilot model described above. After that, they reduced emotion bias by downsampling the weakly-labelled data, limiting by the number of comments belonging to the median emotion count. To avoid over-representation of popular subreddits, they performed downsampling, limited by the median subreddit count. From the 315,000 comments left and 482 subreddits, they randomly sampled for annotation.
- Masking – They masked proper names and religion referring to people with a [NAME] token and [RELIGION] token using a Bidirectional Representation for Transformers (BERT-based Named Entity Tagger). The raters viewed unmasked comments during rating.
The researchers applied Principal Preserved Component Analysis (PPCA) to the data. PPCA examines the cross-covariance between datasets rather than the variance-covariance matrix within a single dataset.
Each component is significant with p-values < 1.5e-6 for all dimensions. This shows that each emotion captures a unique part of the data. In spite of no predefined notion of sentiment in the taxonomy, emotions that are related in terms of their sentiment cluster together. In the same way, emotions that are related in terms of their intensity, like sadness and grief, annoyance and anger, are also closely correlated.