Active Hackathon

How to Automate Data Labelling with Amazon Sagemaker Ground Truth

Amazon Sagemaker Ground Truth was launched to fully manage data labelling services for generating high-quality ground truth datasets to be trained into machine learning models.

AWS(Amazon Web Services) is the most popular and widely used cloud service provider. In 2017 AWS released its fully managed machine learning platform on cloud called Amazon Sagemaker, that allows developers to create, train and deploy their models quickly. In 2018, Amazon Sagemaker Ground Truth was launched to fully manage data labelling services for generating high-quality ground truth datasets to be trained into machine learning models.  

Ground Truth can integrate Amazon Mechanical Turk(the crowdsourcing platform) or internal data labelling team or external 3rd party vendors to get the labelling job done. Workflows can be customized or made use of built-in. This labelled dataset output from Ground Truth can be used to train their own models or as a training dataset for an Amazon SageMaker model.


Sign up for your weekly dose of what's up in emerging technology.

Sagemaker Ground truth offers a wide range of services in image, audio, video, and text having features such as removal of distortion in images, automatic 3D cuboid snapping, and auto-segment tools to reduce the labelling time. Auto Labelling is possible using semi-supervised learning, where it learns to label the data.

Varied pricing for each labelled object (image/video frame, audio recording, a section of the text, etc.) whether it’s labelled automatically by Ground Truth or by a human labeller. If you use a vendor or Mechanical Turk to provide labels, you pay an additional cost per labelled object. If you use your employees for labelling, there is no additional cost per labelled object. The workforce type can be public or private mode.

Custom Workflow 

Create your data labelling workflow in Ground Truth. A custom workflow consists of three components :

  • A  large selection of UI templates that provides users with the instructions and tools needed to perform the labelling task. Users can also upload their own Javascript/HTML template. 
  • AWS Lambda function for pre-processing logic encapsulation to serve the unlabelled data and add any additional context for the labeller.
  • AWS Lambda function for post-processing logic encapsulation used to insert an accuracy improvement algorithm. 

The algorithm assesses the quality of annotations made by humans and can find what is “right” and what is ‘wrong’ when the same data is compared to multiple human labellers.

Amazon SageMaker Ground Truth has a facility for workers to verify the labels are correct or need to be adjusted. These types of jobs fall into two categories:

  • Label verification – The labellers can correct the existing labels, or rate the label quality, and if necessary add comments to explain the reasoning.
  • Label adjustment is done by workers to adjust prior annotations.

Datasets are stored in Amazon Simple Storage Service(S3) buckets. The buckets contain three things: The unlabelled data, input manifest file used to read the data files, and an output manifest file containing results of the labelling job done.


Image Classification, Object Detection, and Semantic Segmentation for various use cases in computer vision such as image classification models for autonomous vehicles to detect various real-world objects such as other vehicles, pedestrians, traffic lights, and signals.


Video multi-frame object classification, Video multi-frame object tracking, and video clip classification. At 30 frames per second, using the built-in GUI one minute of video translates to 1,800 individual images.

3D cloud point 

3D cloud points are captured using LIDAR to generate a 3D understanding of physical space at a single point in time. 3D point cloud data including object detection, objection tracking, and semantic segmentation


Categorizing text into different labels is often used for natural language processing (NLP) models that identify things like topics, product descriptions, movie reviews or sentiment.

Text classification

Entity extraction

Download Annotated Dataset

To download the annotated dataset, download individual files from the S3 bucket. Install Amazon CLI Command Reference to download the entire annotations folder.

pip install awscli

And then run

aws s3 sync s3://<source_bucket> <local_destination>

Code for labelling an image of a dog: 

    "boundingBox": {
      "boundingBoxes": [
           "label": "Dog",
           "height": 840,
           "width": 756
           "top": 20,
           "left": 55,           
        "inputImageProperties": {
        "height": 512,
        "width": 926 }


NFL(Sports), Airbnb(Hospitality), PrecisionHawk(Drone technology), AstraZeneca(Pharmaceuticals), T-Mobile(Wireless service provider), Pinterest(web and mobile application for information on the web), Change Healthcare(healthcare technology company), GumGum(AI company specializing on Computer Vision solutions), Automagi(AI/ML bot Saas service provider), ZipRecruiter(job posting services). 

More Great AIM Stories

Jayita Bhattacharyya
Machine learning and data science enthusiast. Eager to learn new technology advances. A self-taught techie who loves to do cool stuff using technology for fun and worthwhile.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
How Data Science Can Help Overcome The Global Chip Shortage

China-Taiwan standoff might increase Global chip shortage

After Nancy Pelosi’s visit to Taiwan, Chinese aircraft are violating Taiwan’s airspace. The escalation made TSMC’s chairman go public and threaten the world with consequences. Can this move by China fuel a global chip shortage?

Another bill bites the dust

The Bill had faced heavy criticism from different stakeholders -citizens, tech firms, political parties since its inception

So long, Spotify

‘TikTok Music’ is set to take over the online streaming space, but there exists an app that has silently established itself in the Indian market.