Last updated February 17, 2023
In AI Origins & Evolution

Machines Are Dreaming Instead of Learning

Synthetic data would prove to be an important tool for AI modelling in the future and is definitely on the rise. But is there a flip side?

Share

Published on February 17, 2023

by Mohit Pandey

Nearly 60% of all data in AI would be synthetic by 2024, according to a Gartner study. More and more startups are focused just on synthetic data. This includes start-ups such as Mostly.AI offering data generation using existing data, Hazy is famous for generating financial data, Synthetiac is offering image processing data, with many more like Datomize, SynthesisAI, and SkyengineAI.

Synthetic data would prove to be an important tool for AI modelling in the future and is definitely on the rise. This data is particularly beneficial in financial, image classification, and computer-vision-based fields such as autonomous vehicles. But is there a flip side?

Data scarcity, according to AI/ML developers, is one of the biggest reasons hindering the development of further AI models. Data is indubitably one of the most important components for making generative AI models, and interestingly, to fuel AI models with more data, generative AI might be the answer.

The question is—how much of the ‘data problem’ is about the quantity versus the quality of data? To deal with this data scarcity or quantity, people are moving away from accessing and using real data towards using synthetic data. In a nutshell, synthetic data is artificially generated data, either mathematically or statistically, which appears close to real-world data. This also increases the amount of data which, in turn, increases the accuracy of each model and removes all the existing flaws in the data. There are many positive reasons to be attracted towards synthetic data such as data privacy. There are virtually no concerns about data privacy with synthetic data as it is not related to any individual in the real-world.

Too Artificial

In a Reddit thread, developers discuss the pros and cons of synthetic data. A user points out that, “We are standing on the brink of a world where many of the technologies that surround us might not be built in response to reality, but to what a machine imagines reality to be.”

While synthetic data might seem like a panacea citing data privacy and security concerns, it comes with its own set of challenges. Firstly, synthetic data is dependent on real-data in terms of quality. There is a high possibility that the data generated using biased and incomplete data might perform even worse. The fact that it is “synthetic” itself means that it can be highly unreliable. Moreover, real-world data consists of outliers, which might be useful for some of the models.

One of the reasons that synthetic data is on the rise is to tackle the bias that is present in smaller datasets. Even though larger datasets can have poor quality data—which would require higher fine-tuning and heavier workloads—synthetic data does not represent the quality and the amount of variability that is present within real-world data.

Synthetic data is generated using algorithms that model the statistical properties of real data. While it may be able to emulate the distribution and characteristics of the original data, it can never capture the richness and complexity of the real-world phenomena it represents. Therefore, ML models trained on synthetic data may not be as accurate or effective as those trained on real data.

It is particularly challenging to generate accurate synthetic data because the process requires significant expertise and resources to ensure that the data is realistic and meaningful. Even small errors in the generation process can lead to significant inaccuracies. Moreover, the data can be misleading as it is built using a set of parameters, which would result in lack of variability and diversity.

Not Black and White

We are not making a case for using real-world data all the time, since it lacks in several ways. For example, when it comes to healthcare, a lot of data is just unusable because of privacy concerns. But then if you think of it, how reliable can artificially generated data be when it comes to such a sensitive field like healthcare?

There are many ethical challenges when it comes to synthetic data as well. It is true that every dataset suffers from biases. Having data without bias is an illusion. By incorporating more parameters, the “fairness” of the data is actually more questionable. Instead of removal of biases, it can create and amplify further biases.

The removal of biases from a dataset might sound like an ideal way but it comes with several downsides. Real-world data is highly dynamic, nuanced, and complex. Some ML engineers believe that models fed with synthetic data might become a closed system. They would just be a “snapshot in time” and not evolve, creating a ‘reality-gap’ in the models, thereby making AI more artificial than it perhaps needs to be.

Access all our open Survey & Awards Nomination forms in one place