Bot-Heavy Data Labelling Platforms Are The Cause Of Bad Results In AI Research

Richa Bhatia


High-quality training data is a critical asset to the success of artificial intelligence applications and products. But now, data labelling service provider Amazon Mechanical Turk has come under the scanner in the social sciences research community for giving bad results. So far, Amazon Mechanical Turk has been the best way of getting training data for machine learning algorithms. A recent report by a noted news portal, indicated that researchers who use Amazon Mechanical Turk for academic studies are noticing an increase in bad results on account of the survey questions. The output for bad results has been cited as bots, or human-augmented bots or even humans themselves. There is also a fear about the robots replacing human Turkers in some way, and as a result making the platform less reliable for the types of research they’re conducting.

Amazon launched Mechanical Turks in 2005 as a tool for crowdsourcing training and which paid people a small amount of money for performing tasks through the desktops. Mechanical Turk or MTurk is essentially a crowdsourcing marketplace where the Requester publishes and coordinates a wide set of Human Intelligence Tasks (HITs), such as classification, tagging, surveys, and transcriptions. And users can choose from the tasks, thereby earning a small amount of money for each completed task.



Since its launch, the platform’s popularity has soared and users are known to label a 1,000 record-dataset for a fee of $300 (plus fees) in a few hours. According to a report, a recent Mechanical Turk listing offered workers 80 cents to read a restaurant review and then answer a survey about their impressions of it; the time limit was 45 minutes.

Of late, many researchers use Mechanical Turk as a means of human-labelled training data which is leading to bad output for models. High-quality training data is a major requirement for driving deep-learning-based approach to AI. As the stakes get higher for researchers and consumer-based AI applications are being rolled out rapidly, the reliance on quality training data has increased. Researchers believe it is becoming easy to spot the potential problems of a bot-heavy Mechanical Turk platform which are returning nonsensical responses or labels. And because of the mislabeling or erroneous labels/answers, researchers believe it is skewing the accuracy of the resulting models.

Is MTurk Turning Into A Menace For AI Researchers

With the need for training data going up, organisations and researchers are increasingly relying on third parties/platforms and startups that are providing Training Data As a Service solution. For example, Seattle based startup Mighty AI provides training data to companies that build computer vision models for autonomous vehicles. These as-a-service solution providers are doing a good job in labeling sensitive data but they should be vetted more responsibly.

See Also

Download our Mobile App



Mighty AI and another startup Figure Eight, transforms real-world messy data into training data and are known to have a stricter vetting process. Not just that, training data as a service provider is also giving companies and startups a platform to automate data-labeling. Besides, the rise in startups who provide synthetic data, is soon becoming the go-to approach to solving data-labeling problem with researchers relying on both — training data and synthetic data. Synthetic data or simulated datasets are secure and would not lead to loss of data information.

Synthetic Data Generation The Answer To Data Inaccuracies

In an earlier report, we revealed how Sergey Nikolenko, chief research officer at Neuromation emphasised that synthetic data is a more efficient way of getting perfectly labelled data for recognition. He shared that the synthetic data approach has proven to be very successful, and now the models trained by Neuromation are already being implemented in the retail sector. Synthetic Data generation greatly reduces the manual work required to label data, replicated data is labelled perfectly without any errors and is believed to be a useful tool for testing the scalability of algorithms and the performance of new software. Synthetic data platforms are able to perform a range of low-level tasks and cut down on the requirement of human labour. Increasingly, synthetic datasets are becoming a part of the data strategy and have also sparked a notion of open data economy.

 

What Do You Think?

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join Our Telegram Group. Be part of an engaging online community. Join Here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top