How Synthetic Data Might Impact AI In 2022

Synthetic data is artificially generated data that reflects real-world data, either mathematically or statistically

Share

Published on December 18, 2021

by Victor Dey

Data is the new oil in today’s rapidly developing age of AI, but collecting and processing accurate data can sometimes be expensive and time-consuming. Therefore, many are today making their own fuel, one that’s both inexpensive and effective, known as synthetic data. The data is created in a digital environment rather than being collected from or measured in the real world. Synthetic data is artificially generated data that reflects real-world data, either mathematically or statistically. Research demonstrates it can be a good or better alternative for training AI models than data based on actual objects, events or people.

Synthetic Data: Not A New Trend

Forbes picked synthetic data in a list of the 5 biggest data science trends for 2022, while

Gartner listed it as one of the top strategic predictions for 2022 and beyond. Given that time-saving is a key factor in today’s AI developments, it looks like the driving force behind newer synthetic data development. It takes as long as 20 laborious weeks to gather and annotate 100,000 real-world images, which is a general requirement to train a visual AI system to see and understand the world as a human brain does. That makes it 80 per cent of machine learning project time, just for something basic, like training a system. In the coming years, we might just see a major change in how today’s world views data and how we train AI.

Image Source: Gartner

A recent June 2021 report on synthetic data by Gartner predicts that by 2030, much of the data used in AI will be artificially generated by rules, statistical models, simulations or other techniques. “The fact is you won’t be able to build high-quality, high-value AI models without synthetic data,” said the published report.

The rise of synthetic data also comes as AI pioneers such as Andrew Ng call for a broad shift to a data-centric approach to machine learning. He is also rallying his support for a benchmark or competition on data quality to the claim of it representing 80 per cent of the work in AI. “Most benchmarks provide a fixed set of data and invite researchers to iterate on the code; perhaps it’s time to hold the code fixed and invite researchers to improve the data,” he wrote in his newsletter, The Batch.

Can Synthetic Data bring the best in AI?

Today’s fast-paced environments need democratized access to training data – training data that meet privacy regulations and data that can be annotated faster. Synthetic data can help meet these demands. For an AI system, there is no ‘real’ or ‘synthetic’; there’s only data that we feed it to understand from. Synthetic data creation platforms for AI training can generate the thousands of high-quality images needed in a couple of days instead of months. And because the data is computer-generated through this method, there are no privacy concerns. At the same time, biases that exist in real-world visual data can be easily tackled and eliminated. Furthermore, these computer-generated datasets come automatically labelled and can deliberately include rare but crucial corner cases, even better than real-world data.

Findings show that using synthetic data enhances a machine learning model’s accuracy – McKinsey revealed last year that 49 per cent of the highest-performing AI companies have already been making use of synthetic data to train their AI models. Recent research also suggests that great training results can also be achieved from a hybrid dataset, comprising 90 per cent of synthetic and 10 per cent of real-world data. Gartner also recently predicted that 60 per cent of the data used for AI and data analytics projects will be synthetic by 2024. By 2030, synthetic data will have completely overtaken real data in AI models.

Summing up

Considering all the benefits of synthetic data and facts such as generation and usage, there’s no denying that synthetic data indeed has the potential to revolutionize machine learning model training. Using synthetic data, one can further discover and explore the true potential of AI, overcoming the barriers that are holding back the domain. In addition, the limitless flexibility offered by synthetic data to generate any use case with any required specifications to answer makes it a truly revelatory leap forward.

Access all our open Survey & Awards Nomination forms in one place

Victor Dey

Victor is an aspiring Data Scientist & is a Master of Science in Data Science & Big Data Analytics. He is a Researcher, a Data Science Influencer and also an Ex-University Football Player. A keen learner of new developments in Data Science and Artificial Intelligence, he is committed to growing the Data Science community.