MITB Banner

How to Automate Data Labelling with Amazon Sagemaker Ground Truth

Amazon Sagemaker Ground Truth was launched to fully manage data labelling services for generating high-quality ground truth datasets to be trained into machine learning models.
Share

AWS(Amazon Web Services) is the most popular and widely used cloud service provider. In 2017 AWS released its fully managed machine learning platform on cloud called Amazon Sagemaker, that allows developers to create, train and deploy their models quickly. In 2018, Amazon Sagemaker Ground Truth was launched to fully manage data labelling services for generating high-quality ground truth datasets to be trained into machine learning models.  

Ground Truth can integrate Amazon Mechanical Turk(the crowdsourcing platform) or internal data labelling team or external 3rd party vendors to get the labelling job done. Workflows can be customized or made use of built-in. This labelled dataset output from Ground Truth can be used to train their own models or as a training dataset for an Amazon SageMaker model.

Sagemaker Ground truth offers a wide range of services in image, audio, video, and text having features such as removal of distortion in images, automatic 3D cuboid snapping, and auto-segment tools to reduce the labelling time. Auto Labelling is possible using semi-supervised learning, where it learns to label the data.

Varied pricing for each labelled object (image/video frame, audio recording, a section of the text, etc.) whether it’s labelled automatically by Ground Truth or by a human labeller. If you use a vendor or Mechanical Turk to provide labels, you pay an additional cost per labelled object. If you use your employees for labelling, there is no additional cost per labelled object. The workforce type can be public or private mode.

Custom Workflow 

Create your data labelling workflow in Ground Truth. A custom workflow consists of three components :

  • A  large selection of UI templates that provides users with the instructions and tools needed to perform the labelling task. Users can also upload their own Javascript/HTML template. 
  • AWS Lambda function for pre-processing logic encapsulation to serve the unlabelled data and add any additional context for the labeller.
  • AWS Lambda function for post-processing logic encapsulation used to insert an accuracy improvement algorithm. 

The algorithm assesses the quality of annotations made by humans and can find what is “right” and what is ‘wrong’ when the same data is compared to multiple human labellers.

Amazon SageMaker Ground Truth has a facility for workers to verify the labels are correct or need to be adjusted. These types of jobs fall into two categories:

  • Label verification – The labellers can correct the existing labels, or rate the label quality, and if necessary add comments to explain the reasoning.
  • Label adjustment is done by workers to adjust prior annotations.

Datasets are stored in Amazon Simple Storage Service(S3) buckets. The buckets contain three things: The unlabelled data, input manifest file used to read the data files, and an output manifest file containing results of the labelling job done.

Image

Image Classification, Object Detection, and Semantic Segmentation for various use cases in computer vision such as image classification models for autonomous vehicles to detect various real-world objects such as other vehicles, pedestrians, traffic lights, and signals.

Video

Video multi-frame object classification, Video multi-frame object tracking, and video clip classification. At 30 frames per second, using the built-in GUI one minute of video translates to 1,800 individual images.

3D cloud point 

3D cloud points are captured using LIDAR to generate a 3D understanding of physical space at a single point in time. 3D point cloud data including object detection, objection tracking, and semantic segmentation

Text

Categorizing text into different labels is often used for natural language processing (NLP) models that identify things like topics, product descriptions, movie reviews or sentiment.

Text classification

Entity extraction

Download Annotated Dataset

To download the annotated dataset, download individual files from the S3 bucket. Install Amazon CLI Command Reference to download the entire annotations folder.

pip install awscli

And then run

aws s3 sync s3://<source_bucket> <local_destination>

Code for labelling an image of a dog: 

[
  {
    "boundingBox": {
      "boundingBoxes": [
         {
           "label": "Dog",
           "height": 840,
           "width": 756
           "top": 20,
           "left": 55,           
         }
        ],
        "inputImageProperties": {
        "height": 512,
        "width": 926 }
     }
   }
 ]

Customers

NFL(Sports), Airbnb(Hospitality), PrecisionHawk(Drone technology), AstraZeneca(Pharmaceuticals), T-Mobile(Wireless service provider), Pinterest(web and mobile application for information on the web), Change Healthcare(healthcare technology company), GumGum(AI company specializing on Computer Vision solutions), Automagi(AI/ML bot Saas service provider), ZipRecruiter(job posting services). 

PS: The story was written using a keyboard.
Share
Picture of Jayita Bhattacharyya

Jayita Bhattacharyya

Machine learning and data science enthusiast. Eager to learn new technology advances. A self-taught techie who loves to do cool stuff using technology for fun and worthwhile.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India