For AI-based software to work well in the real world, a large amount of high-quality data is required to train the system. The biggest problem with data preprocessing is labelling. It is both a time and resource-intensive process. A study found that data scientists spend about 80 percent of their time preprocessing data on average, and only 20 percent of the time is dedicated to actually building machine learning models.
A possible solution to this problem is outsourcing data or crowdsourcing it. Crowdsourcing data is now the norm as it is cheap and helps data scientists to channelise their efforts towards more skilled based tasks.
Various platforms such as Amazon Mechanical Turk, Lionbridge AI, Clickworker offer on-demand data labelling services.
Crowdsourcing data labelling
Data science teams prefer to outsource data labelling over carrying out the same task in-house. This offers the following benefits:
- Eliminates the need to hire thousands of temporary employees.
- Reduces data scientists’ workload
- In-house data labelling requires investment in annotation tools. This cost is eliminated with crowdsourcing (subject to relative costs)
Most crowdsourcing platforms assign freelancers from around the world to annotate data. At the most basic level, Crowdsourcing platforms break the project down into smaller tasks which are then distributed among multiple freelancers.
Since data labelling is a low skill job, and basic qualifications would be enough to become a data labeller. Aptitude to learn new tools and work with technology are other desirable traits. In addition, specific data labelling jobs may require the candidate to have particular skills like language translation.
Data labellers are often provided project-specific training from the crowdsourcing platforms they are working for. These platforms may also provide tools and resources to help labellers learn more about their products and help them be more productive. These resources include code samples, libraries, tutorials, technical documentation, and notes. They also assist labellers in guides and tutorials to help them excel at their work.
Interestingly, countries such as India are becoming hotspots for outsourcing data labelling services. In an earlier interview with Analytics India Magazine, Shishir Thakur, CEO and founder, Cranberry Tech, said, “India has emerged as a huge pool of employable workers to undertake data labelling jobs. The reasons are essentially the same which led to the expansion of the BPO/KPO service industry in India in the past 20 years:
- Cost-effective workforce
- English literacy and basic computing skills
- High speed and cheap internet
- Stable economy – compared to some other East-European/African/South-Asian countries”
Choosing a data labelling platform
Before choosing a data labelling team one must always take note of the tools a platform employs for data labelling. These tools must be coherent with the particular use case and data.
In addition to that, the selected data platform must contain a management system to manage data, projects, and users. It should allow the assignee to communicate with labellers regarding work, mislabelled data, and implement labelling workflows.
Any model is as good as the data fed to the model during training, hence, the chosen platform must have a quality control process that lets the manager control the quality of labelled data. Data labelling services should be trained, vetted, and managed by experts to ensure high-quality services.
Crowdsourcing companies vary in features they offer, data security, practices, and more. It, therefore, becomes imperative to evaluate your service provider before committing to it.
One must look at client logos, testimonials, case studies to understand previous clients’ experience with the platform.
Data confidentiality and protection is an important aspect, especially in case of critical tasks. Before selecting a data labelling platform, one must clearly understand the security protocols and look at measures including non-disclosure agreements to prevent data theft and leaks. An ISO certification might go a long way in ensuring that your sensitive data is in safe hands.
One can also ask for a pilot project before committing to a crowdsourcing partner. This will ensure that the kind of work you want is parallel to the work offered by the platform.
When data labelling is done in one’s organisation, employees can be vetted, trained, and actively managed so that you can assure the quality of labelled data. However, as crowdsourced data labelling is done beyond the supervision of the assignee, it may not be very reliable and is not of very high quality. Besides this, a security question will always loom if sensitive data is outsourced.