Last updated February 2, 2021
In AI Origins & Evolution

How Synthetic Data Sets Can Improve Computer Vision Models

Share

Published on August 1, 2020

by Vishal Chawla

In recent years, deep learning models have produced a substantial amount of advances in various areas, including computer vision. Computer vision typically usually works by analysing images that have been captured using the physical camera sensor, followed by a human-in-the-loop process that requires annotators to label things of interest.

For example, for spotting a tiny detail within an image, a simple bounding box around the object might suffice. But once you start looking to get a robot to grasp something, you might need a segmentation mask to flesh out the fine contours of the object. Once this data is collected and labelled, we can then train this algorithm, following which it can be incorporated into an edge device such as a smart camera, to be sold to consumers or businesses.

For practitioners in modern computer vision, the greatest bottleneck throughout this whole process has often been data. The first two steps, collecting and annotating data, usually takes several months. Another reason why data is the real bottleneck is that algorithms these days are a dime a dozen and the hundreds of new ones pop up regularly. According to experts, synthetic data can help computer vision and ML engineers who want a quicker, effective way of sourcing and annotating photographs to train their AI models.

Synthetic Data Overcomes The Challenges That Comes With Manual Annotated Data

The supervised training of deep models needs a lot of annotated data available, which can be an exhausting and costly task to perform. An alternative is to use synthetic data, often photorealistic, to train vision AI models.

The synthetic dataset is a repository of data that is generated programmatically and is not collected by any real-life survey or experiment. Its primary purpose, therefore, is to be flexible and rich enough to help an ML practitioner conduct fascinating experiments with various classification, regression, and clustering algorithms. It solves the challenges of unavailability of datasets.

Collecting visual data is labour intensive, and is a bottleneck for many companies. Even if you have this data, annotations can be mislabeled, and the errors can throw your algorithm in a serious rut. There have been so many times when companies have had to spend many hours fixing the annotations they received from outsourced labellers.

Another major problem and perhaps the most critical part of the equation here is simply the overall labour costs. There are costs involved in terms of the employees collecting the data, annotators who are labelling them, and the computer vision professionals who are often looking over them again for errors. Synthetic data can address these challenges efficiently.

Synthetic Data Generation Via Simulation

Startups are focusing on generating high-quality synthetic data to help train the computer vision algorithms of the future in a simulated environment, which means that it involves making entire frames with sophisticated environmental modelling, sensor noise modelling and rendering. Synthetic data engines can create perfectly annotated images of objects or environments needed. Such services utilise an immense library of 3D CAD files which is procedurally placed into massively parallel, algorithmically generated in the simulation environment

In one study researchers utilised domain randomisation for car detection by completely dropping photorealism in the creation of the synthetic dataset. The main focus is to force the network into learning only the fundamental features of the task. As of today, synthetic data won’t completely replace real data, but we’re starting to get close. A study suggests that by only using 10% of the real label data, you can get a performance as good as using all of the label data when augmenting with synthetic data.

In conclusion, researchers have recognised the impressive results of synthetic training deeming it valuable for computer vision as real data can be expensive to annotate. Using simulated data as a cheaper source of training samples can provide significant savings of both cost and time, and can also address the scarcity of real-world data.

Access all our open Survey & Awards Nomination forms in one place

Vishal Chawla

Vishal Chawla is a senior tech journalist at Analytics India Magazine and writes about AI, data analytics, cybersecurity, cloud computing, and blockchain. Vishal also hosts AIM's video podcast called Simulated Reality- featuring tech leaders, AI experts, and innovative startups of India.