Bot-Heavy Data Labelling Platforms Are The Cause Of Bad Results In AI Research

High-quality training data is a critical asset to the success of artificial intelligence applications and products. But now, data labelling service provider Amazon Mechanical Turk has come under the scanner in the social sciences research community for giving bad results. So far, Amazon Mechanical Turk has been the best way of getting training data for machine learning algorithms. A recent report by a noted news portal, indicated that researchers who use Amazon Mechanical Turk for academic studies are noticing an increase in bad results on account of the survey questions. The output for bad results has been cited as bots, or human-augmented bots or even humans themselves. There is also a fear about the robots replacing human Turkers in some way, and as a result making the platform less reliable for the types of research they’re conducting.


Sign up for your weekly dose of what's up in emerging technology.

Amazon launched Mechanical Turks in 2005 as a tool for crowdsourcing training and which paid people a small amount of money for performing tasks through the desktops. Mechanical Turk or MTurk is essentially a crowdsourcing marketplace where the Requester publishes and coordinates a wide set of Human Intelligence Tasks (HITs), such as classification, tagging, surveys, and transcriptions. And users can choose from the tasks, thereby earning a small amount of money for each completed task.

Since its launch, the platform’s popularity has soared and users are known to label a 1,000 record-dataset for a fee of $300 (plus fees) in a few hours. According to a report, a recent Mechanical Turk listing offered workers 80 cents to read a restaurant review and then answer a survey about their impressions of it; the time limit was 45 minutes.

Of late, many researchers use Mechanical Turk as a means of human-labelled training data which is leading to bad output for models. High-quality training data is a major requirement for driving deep-learning-based approach to AI. As the stakes get higher for researchers and consumer-based AI applications are being rolled out rapidly, the reliance on quality training data has increased. Researchers believe it is becoming easy to spot the potential problems of a bot-heavy Mechanical Turk platform which are returning nonsensical responses or labels. And because of the mislabeling or erroneous labels/answers, researchers believe it is skewing the accuracy of the resulting models.

Is MTurk Turning Into A Menace For AI Researchers

With the need for training data going up, organisations and researchers are increasingly relying on third parties/platforms and startups that are providing Training Data As a Service solution. For example, Seattle based startup Mighty AI provides training data to companies that build computer vision models for autonomous vehicles. These as-a-service solution providers are doing a good job in labeling sensitive data but they should be vetted more responsibly.

Mighty AI and another startup Figure Eight, transforms real-world messy data into training data and are known to have a stricter vetting process. Not just that, training data as a service provider is also giving companies and startups a platform to automate data-labeling. Besides, the rise in startups who provide synthetic data, is soon becoming the go-to approach to solving data-labeling problem with researchers relying on both — training data and synthetic data. Synthetic data or simulated datasets are secure and would not lead to loss of data information.

Synthetic Data Generation The Answer To Data Inaccuracies

In an earlier report, we revealed how Sergey Nikolenko, chief research officer at Neuromation emphasised that synthetic data is a more efficient way of getting perfectly labelled data for recognition. He shared that the synthetic data approach has proven to be very successful, and now the models trained by Neuromation are already being implemented in the retail sector. Synthetic Data generation greatly reduces the manual work required to label data, replicated data is labelled perfectly without any errors and is believed to be a useful tool for testing the scalability of algorithms and the performance of new software. Synthetic data platforms are able to perform a range of low-level tasks and cut down on the requirement of human labour. Increasingly, synthetic datasets are becoming a part of the data strategy and have also sparked a notion of open data economy.


More Great AIM Stories

Richa Bhatia
Richa Bhatia is a seasoned journalist with six-years experience in reportage and news coverage and has had stints at Times of India and The Indian Express. She is an avid reader, mum to a feisty two-year-old and loves writing about the next-gen technology that is shaping our world.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM