How to Automate Data Labelling with Amazon Sagemaker Ground Truth

Amazon Sagemaker Ground Truth was launched to fully manage data labelling services for generating high-quality ground truth datasets to be trained into machine learning models.

AWS(Amazon Web Services) is the most popular and widely used cloud service provider. In 2017 AWS released its fully managed machine learning platform on cloud called Amazon Sagemaker, that allows developers to create, train and deploy their models quickly. In 2018, Amazon Sagemaker Ground Truth was launched to fully manage data labelling services for generating high-quality ground truth datasets to be trained into machine learning models.  

Ground Truth can integrate Amazon Mechanical Turk(the crowdsourcing platform) or internal data labelling team or external 3rd party vendors to get the labelling job done. Workflows can be customized or made use of built-in. This labelled dataset output from Ground Truth can be used to train their own models or as a training dataset for an Amazon SageMaker model.

Sagemaker Ground truth offers a wide range of services in image, audio, video, and text having features such as removal of distortion in images, automatic 3D cuboid snapping, and auto-segment tools to reduce the labelling time. Auto Labelling is possible using semi-supervised learning, where it learns to label the data.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Varied pricing for each labelled object (image/video frame, audio recording, a section of the text, etc.) whether it’s labelled automatically by Ground Truth or by a human labeller. If you use a vendor or Mechanical Turk to provide labels, you pay an additional cost per labelled object. If you use your employees for labelling, there is no additional cost per labelled object. The workforce type can be public or private mode.

Custom Workflow 

Create your data labelling workflow in Ground Truth. A custom workflow consists of three components :

  • A  large selection of UI templates that provides users with the instructions and tools needed to perform the labelling task. Users can also upload their own Javascript/HTML template. 
  • AWS Lambda function for pre-processing logic encapsulation to serve the unlabelled data and add any additional context for the labeller.
  • AWS Lambda function for post-processing logic encapsulation used to insert an accuracy improvement algorithm. 

The algorithm assesses the quality of annotations made by humans and can find what is “right” and what is ‘wrong’ when the same data is compared to multiple human labellers.

Amazon SageMaker Ground Truth has a facility for workers to verify the labels are correct or need to be adjusted. These types of jobs fall into two categories:

  • Label verification – The labellers can correct the existing labels, or rate the label quality, and if necessary add comments to explain the reasoning.
  • Label adjustment is done by workers to adjust prior annotations.

Datasets are stored in Amazon Simple Storage Service(S3) buckets. The buckets contain three things: The unlabelled data, input manifest file used to read the data files, and an output manifest file containing results of the labelling job done.


Image Classification, Object Detection, and Semantic Segmentation for various use cases in computer vision such as image classification models for autonomous vehicles to detect various real-world objects such as other vehicles, pedestrians, traffic lights, and signals.


Video multi-frame object classification, Video multi-frame object tracking, and video clip classification. At 30 frames per second, using the built-in GUI one minute of video translates to 1,800 individual images.

3D cloud point 

3D cloud points are captured using LIDAR to generate a 3D understanding of physical space at a single point in time. 3D point cloud data including object detection, objection tracking, and semantic segmentation


Categorizing text into different labels is often used for natural language processing (NLP) models that identify things like topics, product descriptions, movie reviews or sentiment.

Text classification

Entity extraction

Download Annotated Dataset

To download the annotated dataset, download individual files from the S3 bucket. Install Amazon CLI Command Reference to download the entire annotations folder.

pip install awscli

And then run

aws s3 sync s3://<source_bucket> <local_destination>

Code for labelling an image of a dog: 

    "boundingBox": {
      "boundingBoxes": [
           "label": "Dog",
           "height": 840,
           "width": 756
           "top": 20,
           "left": 55,           
        "inputImageProperties": {
        "height": 512,
        "width": 926 }


NFL(Sports), Airbnb(Hospitality), PrecisionHawk(Drone technology), AstraZeneca(Pharmaceuticals), T-Mobile(Wireless service provider), Pinterest(web and mobile application for information on the web), Change Healthcare(healthcare technology company), GumGum(AI company specializing on Computer Vision solutions), Automagi(AI/ML bot Saas service provider), ZipRecruiter(job posting services). 

Jayita Bhattacharyya
Machine learning and data science enthusiast. Eager to learn new technology advances. A self-taught techie who loves to do cool stuff using technology for fun and worthwhile.

Download our Mobile App


AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.