It is difficult to completely differentiate data collection and labelling of a model from unconscious biases, data access limitations, and privacy concerns. AI models are created and trained by humans; they are bound to mimic humans’ biases and prejudices. This often leads to the datasets including unfair social biases along dimensions of race, gender, age, and more. When these biased AI models are used to solve critical societal problems, they result in skewed decisions.
IBM research has found more than 180 human biases in today’s AI systems. For example, Google itself faced backlash when Google Photos classified black people as gorillas when Google’s facial recognition couldn’t recognise people of colour. This June, Facebook researchers spoke about creating a framework for identifying gender biases in texts to understand “social construct and identify languages.”
Dataset examination tools assist in analysing how the representation of different groups is a crucial component in creating a moral ML model. For Google AI, this is a critical step to ensure the responsible use of ML datasets and point toward potential mitigations of unfair outcomes.
Google recently introduced its dataset exploration tool, Know Your Data. KYD is Google’s tool to help researchers, engineers, product teams, and policy teams explore datasets, improve data quality and mitigate bias issues.
Know Your Data
“KYD is a dataset analysis tool that complements the growing suite of responsible AI tools being developed across Google and the broader research community,” Google AI’s blog post stated. “Currently, KYD only supports analysis of a small set of image datasets, but we’re working hard to make the tool accessible beyond this set.”
KYD aims to answer the following questions:
- Is my data corrupted? (e.g. broken images, garbled text, wrong labels, etc.)
- Is my data sensitive? (e.g. are there humans, explicit content)
- Does my data have gaps? (e.g. lack of daylight photos)
- Is my dataset balanced across various attributes?
KYD’s range of features includes filtering, grouping, and studying correlations. In addition, the tool uses Google’s Cloud Vision API to automatically compute labels and show the users signals that weren’t present in the dataset initially. Essentially, the users can explore the dataset in accordance with information that wasn’t present in it.
COCO Captions Case Study
COCO Caption’s image dataset contains five human-generated captions for each of over 300k images, including annotations provided by the free-form text. Researchers at Google applied KYD to this dataset to explore gender biases by analysing the gendered correlations within the image captions. As a result, the tool managed to find gender biases across various depictions and descriptions within the dataset.
KYD’s ‘Relations Tab’ allowed the researchers to examine the difference between activities captioned with ‘man’ or ‘woman’ by visualising the extent to which two signals co-occur more (or less) than would be expected by chance. The cells have blue and orange colours signifying positive or negative correlation between two signal values.
Watch the demonstration here.
A screenshot from KYD showing results for activities and the gendered captions.
KYD’s feature that filters rows of a relational tab further probed for captioned words with ‘ing’, such as ‘shopping’. The exploration tool found strongly gendered co-relations with activities like shopping or cooking that are stereotypically associated with women, occurring with more images captioned with ‘women’ than with ‘men’; vice versa with activities like ‘surfing’ or ‘skateboarding’.
According to Google AI, the essence of the dataset exploration tool lies in the fact that these image captions are not derogatory or stereotypical. Still, the tool manages to find where certain groups are over-represented within an activity across the dataset. KYD quickly brings these risks to the surface, preventing the entire dataset from learning to make stereotypical associations.
Google further used the tool to find an age bias in the COCO dataset, where physical activities were rarely captioned with words like ‘elderly’ or ‘old’. The researchers also found that relative to the caption ‘young’, ‘old’ was used more often to describe belongings or clothing rather than people.
A screenshot from KYD showing results for physical activities and age specifications.
KYD helped point out this underrepresentation; assumingly, because of a smaller dataset for older ‘people’, an aspect of the dataset needs to be worked upon.
This tool is a step toward improving datasets and trailing ML models on fair and unbiased ones. The researchers claim this dataset to be very people-friendly with features for filtering and grouping in real-time and comparing various attributes.