MITB Banner

Google AI Releases Fine-Grained Emotion Dataset ‘GoEmotions’

The GoEmotions Dataset is a human-annotated dataset with 58k Reddit with 27 emotion categories

Share

Google AI has released GoEmotions. It is a human-annotated dataset of 58,000 Reddit comments extracted from popular English-language subreddits and labelled with 27 emotion categories; it includes 12 positive, 11 negative, and 4 ambiguous emotion categories and 1 “neutral” category. The tech giant designed the GoEmotions taxonomy considering both psychology and data applicability. 

The earlier datasets used for emotional analysis were too small and used just six basic emotions — anger, surprise, disgust, joy, fear, and sadness. 

The GoEmotions taxonomy wants to

  • provide the greatest coverage of the emotions expressed in Reddit data
  • provide the best coverage of types of emotional expressions
  • limit the overall number of emotions and their overlap

Reddit was chosen as the resource for generating this dataset as it offers publicly available, large volumes of content with direct user-to-user conversation.

What is GoEmotions exactly?

The following steps were involved while building it:

  • Selecting and curating Reddit comments – It uses a Reddit data dump originating in the Reddit-data-tools project. Google used Reddit comments from 2005 to January 2019, sourced from subreddits with at least 10,000 comments. It excluded deleted and non-English comments. 

But Reddit comes with its own problems as well. It has a known demographic bias leaning towards young male users, which is not reflective of a globally diverse population. It also has a leaning towards toxic, offensive language. The researchers identified harmful comments using predefined terms for offensive, adult and vulgar content and for identity and religion to solve this problem. It was used for data filtering and masking.

To reduce profanity, the researchers removed subreddits that were not safe for work and those where 10% of comments had offensive and vulgar tokens. They preserved the vulgar comments as they helped to learn about negative emotions. They reviewed identity comments and removed offensive ones towards a particular ethnicity, gender, sexual orientation, or disability.

  • Filtering of length – The researchers applied Natural Language Toolkit (NLTK’s) word tokenizer and selected comments 3-30 tokens long, including punctuation. They performed downsampling, capped by the number of comments with the median token count. They then run an emotion prediction model trained on a pilot batch of 2,200 annotated examples. They exclude subreddits consisting of more than 30% neutral comments or less than 20% negative, positive, or ambiguous comments. 

Image: Google

They assigned emotion to each comment using the pilot model described above. After that, they reduced emotion bias by downsampling the weakly-labelled data, limiting by the number of comments belonging to the median emotion count. To avoid over-representation of popular subreddits, they performed downsampling, limited by the median subreddit count. From the 315,000 comments left and 482 subreddits, they randomly sampled for annotation.

  • Masking – They masked proper names and religion referring to people with a [NAME] token and [RELIGION] token using a Bidirectional Representation for Transformers (BERT-based Named Entity Tagger). The raters viewed unmasked comments during rating.

Image: Google

Principal Preserved Component Analysis

The researchers applied Principal Preserved Component Analysis (PPCA) to the data. PPCA examines the cross-covariance between datasets rather than the variance-covariance matrix within a single dataset.

Results

Images: Google

Each component is significant with p-values < 1.5e-6 for all dimensions. This shows that each emotion captures a unique part of the data. In spite of no predefined notion of sentiment in the taxonomy, emotions that are related in terms of their sentiment cluster together. In the same way, emotions that are related in terms of their intensity, like sadness and grief, annoyance and anger, are also closely correlated.

Share
Picture of Sreejani Bhattacharyya

Sreejani Bhattacharyya

I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at sreejani.bhattacharyya@analyticsindiamag.com
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.