MITB Banner

Watch More

Is AI fast becoming a technology built on worker exploitation from Global South?

While working with crowd work platforms for datasets, it is essential to consider annotator subjectivity as it has the capability to make the data set of extremely high or low quality, which in turn affects the whole ML model.

Data labelling and annotation play a crucial role in ML development. Most ML and deep learning models are data-hungry, and manually labelling each image, text, audio, etc., in large datasets is tedious and a labour-intensive process. Hence, many companies choose not to handle data labelling and use specialised software. But when the pre-built solutions do not meet their specific needs, they move to modifiable open-source platforms. Various platforms like Amazon Mechanical Turk, Clickworker, and Lionbridge AI offer on-demand data labelling services. But, companies fail to focus on the ethical considerations around the processes and decisions that go into building ML datasets.

According to Google research, data generated in crowd work tasks is generally shaped by many social factors, and these datasets continue to shape systems for long, even after worker engagement ends. Google argues that this impacts not just the current but also the future models built from such data and that understanding the perspectives within datasets is essential to understanding resulting models and their potential social impact.

There are two major factors that impact the annotators and their work that can hinder in creating an ethical dataset. The first one is how the annotators’ personal experiences impact their annotations, and the second one is their work environment, i.e., their relation with the crowdsourcing platform. These factors are crucial as their individual perspectives and biases may get encoded within the dataset labels.

The socio-economic background of annotators can impact the data set

The importance of data labelling has grown as deep learning techniques require large amounts of data to train their models. According to Grand View Research, the data collection and labelling market is expected to grow to USD 8.2 billion by 2028. Over 80% of the ML development process consists of data preparation tasks like collection, labelling and cleaning.

According to a paper by researchers from Cornell, Princeton, University of Montreal, and the National Institute of Statistical Sciences, a lot of data annotation and labelling work is done outside the United States and Western countries. The work exploits workers from around the world where labour is cheap. Companies like Samasource, Mighty AI and Scale AI operate in the United States, but crowdsource workers from around the world, primarily from sub-Saharan Africa and Southeast Asia. This leads to a high disparity between the profits earned by data labelling companies and how much they pass on to their workers.

Little attention is documented about annotator positionality, and their social identity shapes their understanding of the world. Crowd workers are selected by task requesters based on the quality metrics and not on any socially defining features. This is concerning as many times crowdsourced annotations are used to build datasets that capture subjective phenomena like hate speech and sentiment. There is value in accounting and acknowledging workers’ socio-cultural backgrounds from the perspective of data quality and also social impact.

Google research suggests that accounting for lived experiences of annotators as expertise may also be of great utility in some cases. For example, women experience higher rates of sexual harassment online and are more likely to identify it, especially the ones who have experienced online abuse. Incorporating antiracist activists’ perspectives into hate speech annotations yielded better-aligned models.

An important line to be answered in data collection is how much annotator subjectivity matters for the task at hand and its impact on the end result.  

Worker experience has a crucial role in the quality of the dataset

Workplace experiences of dataset annotation add another layer of considerations that can affect the work and can have an impact on the AI model. The issues at work could relate to compensation, imbalances in the relationship between the worker and the requester, and even the structure of annotation work itself.

Though data labelling is not physically intensive, workers have reported that the pace and volume of their tasks are “monotonous” and “mentally exhausting.” In the Global South, local companies like Fastagger in Kenya, Sebenz.ai in South Africa, and Supahands in Malaysia have begun to proliferate. With the scaling AI development, the expansion of these companies does open new doors for low-skilled labourers but also presents a chance for exploitation.

Research also suggests that the majority of crowd workers (94%) have had work that was rejected or for which they were not paid. But the requesters retain full rights over the data they receive. This system does enable wage theft. But, currently, many countries do not have regulations around crowd work. The minimum wage, too, is not applicable for crowd workers as they are not ‘workers’ but ‘independent contractors.’ The geographically distributed and anonymous nature of crowdsourced annotation and labelling work imposes significant barriers to collective action.

Wrapping up

While choosing crowd work datasets, it is essential for companies to be intentional about whether to lean on annotators’ subjective judgements. Not accounting for task subjectivity can lead to inadvertent biases and miss critical insights about tasks that could benefit from the annotators’ lived experiences. Clarifying such aspects of the tasks has ramifications for how properly the datasets capture the aspects of human intelligence.

While choosing an annotation platform, it is important to choose the one that allows flexibility in designing custom annotator pools and can include various socio-demographic axes. These decisions should be guided by considering the communities that will be most impacted by models built from the data and the ones that could be harmed the most if they are not represented. To keep the dataset high quality, it is also essential to compare the minimum pay requirements across different platforms and choose to support the one that upholds fair pay standards.

Access all our open Survey & Awards Nomination forms in one place >>

Picture of Meeta Ramnani

Meeta Ramnani

Meeta’s interest lies in finding out real practical applications of technology. At AIM, she writes stories that question the new inventions and the need to develop them. She believes that technology has and will continue to change the world very fast and that it is no more ‘cool’ to be ‘old-school’. If people don’t update themselves with the technology, they will surely be left behind.

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories