Now Reading
How to Automate Data Labelling with Amazon Sagemaker Ground Truth

How to Automate Data Labelling with Amazon Sagemaker Ground Truth

AWS(Amazon Web Services) is the most popular and widely used cloud service provider. In 2017 AWS released its fully managed machine learning platform on cloud called Amazon Sagemaker, that allows developers to create, train and deploy their models quickly. In 2018, Amazon Sagemaker Ground Truth was launched to fully manage data labelling services for generating high-quality ground truth datasets to be trained into machine learning models.  

Ground Truth can integrate Amazon Mechanical Turk(the crowdsourcing platform) or internal data labelling team or external 3rd party vendors to get the labelling job done. Workflows can be customized or made use of built-in. This labelled dataset output from Ground Truth can be used to train their own models or as a training dataset for an Amazon SageMaker model.

Sagemaker Ground truth offers a wide range of services in image, audio, video, and text having features such as removal of distortion in images, automatic 3D cuboid snapping, and auto-segment tools to reduce the labelling time. Auto Labelling is possible using semi-supervised learning, where it learns to label the data.

Varied pricing for each labelled object (image/video frame, audio recording, a section of the text, etc.) whether it’s labelled automatically by Ground Truth or by a human labeller. If you use a vendor or Mechanical Turk to provide labels, you pay an additional cost per labelled object. If you use your employees for labelling, there is no additional cost per labelled object. The workforce type can be public or private mode.

Custom Workflow 


Create your data labelling workflow in Ground Truth. A custom workflow consists of three components :

  • A  large selection of UI templates that provides users with the instructions and tools needed to perform the labelling task. Users can also upload their own Javascript/HTML template. 
  • AWS Lambda function for pre-processing logic encapsulation to serve the unlabelled data and add any additional context for the labeller.
  • AWS Lambda function for post-processing logic encapsulation used to insert an accuracy improvement algorithm. 

The algorithm assesses the quality of annotations made by humans and can find what is “right” and what is ‘wrong’ when the same data is compared to multiple human labellers.

Amazon SageMaker Ground Truth has a facility for workers to verify the labels are correct or need to be adjusted. These types of jobs fall into two categories:

  • Label verification – The labellers can correct the existing labels, or rate the label quality, and if necessary add comments to explain the reasoning.
  • Label adjustment is done by workers to adjust prior annotations.

Datasets are stored in Amazon Simple Storage Service(S3) buckets. The buckets contain three things: The unlabelled data, input manifest file used to read the data files, and an output manifest file containing results of the labelling job done.


Image Classification, Object Detection, and Semantic Segmentation for various use cases in computer vision such as image classification models for autonomous vehicles to detect various real-world objects such as other vehicles, pedestrians, traffic lights, and signals.


Video multi-frame object classification, Video multi-frame object tracking, and video clip classification. At 30 frames per second, using the built-in GUI one minute of video translates to 1,800 individual images.

3D cloud point 

3D cloud points are captured using LIDAR to generate a 3D understanding of physical space at a single point in time. 3D point cloud data including object detection, objection tracking, and semantic segmentation


Categorizing text into different labels is often used for natural language processing (NLP) models that identify things like topics, product descriptions, movie reviews or sentiment.

Text classification

See Also

Entity extraction

Download Annotated Dataset

To download the annotated dataset, download individual files from the S3 bucket. Install Amazon CLI Command Reference to download the entire annotations folder.

pip install awscli

And then run

aws s3 sync s3://<source_bucket> <local_destination>

Code for labelling an image of a dog: 

    "boundingBox": {
      "boundingBoxes": [
           "label": "Dog",
           "height": 840,
           "width": 756
           "top": 20,
           "left": 55,           
        "inputImageProperties": {
        "height": 512,
        "width": 926 }


NFL(Sports), Airbnb(Hospitality), PrecisionHawk(Drone technology), AstraZeneca(Pharmaceuticals), T-Mobile(Wireless service provider), Pinterest(web and mobile application for information on the web), Change Healthcare(healthcare technology company), GumGum(AI company specializing on Computer Vision solutions), Automagi(AI/ML bot Saas service provider), ZipRecruiter(job posting services). 

What Do You Think?

Join Our Discord Server. Be part of an engaging online community. Join Here.

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top