Ajinkya Bhave, Country Head (India) at Siemens Engineering Services, recently spoke about Siemens’ application of synthetic data to deploy the machine in real-world scenarios. “The idea was that we created synthetic training data, which was then used to train a neural network on a digital twin of the model. Then we tested that on the real faults which occur in the ball bearings of the gearboxes with the physical data. The graph showed us the prediction was pretty accurate,” he said, discussing a gearbox problem for wind turbines. He is not the only one asserting the benefits of synthetic data to train AI models.
MIT Technology Review recently titled synthetic data usage for AI as one of the ten breakthrough technologies of 2022. Forrester’s research even identified synthetic data as part of the AI 2.0. The world is becoming data hungry by the day, and in this revolution driven by AI trained on models and their privacy issues, information is always in scarcity. This lack of data availability is paired with the need for accurate and adequate data to train a fair model. Synthetic data can become a saving grace here. For instance, in an example quoted by MIT, researchers at Data Science Nigeria created synthetic data of African clothing to balance the ton of datasets for Western clothing. The African dataset and images were created from scratch using AI.
What is synthetic data?
Synthetic data is artificially generated data that reflects real-world data, either mathematically or statistically. It has been proved to be an alternative to real data for model training by research. Several algorithms and tools generate synthetic data to create a simulation of reality. When used properly, synthetic data can be a good addition to human-annotated data while maintaining the speed and cost factors of the project.
How is synthetic data re-framing AI training
These fake or artificially created data can train AIs in arenas where real data is scarce or too sensitive to use. For instance, Uber uses synthetic data to validate anomaly detection algorithms and predictions on scarce data. Synthetic data has been used in driverless cars for years, and Forbes and Garter have both made their predictions about the importance of this data, but what exactly makes it essential for AI?
Deepfakes, biased AI and privacy issues have become a huge crisis with AI models; simply, models trained on inadequate data will generate incorrect and untrustworthy predictions. On the other hand, the development of GANs and their ability to generate realistic yet fake predictions has made the creation of synthetic data easier.
Last November, NVIDIA’s Jensen Huang launched the Omniverse Replicator, “an engineer for generating synthetic data with ground truths for training AI networks”. In a conversation with IEEE, Rev Lebaredian, VP, simulation technology and Omniverse engineering at NVIDIA, revealed synthetic data can make AI systems better and even more ethical.
Real data lacks in several ways. For starters, the current information is not all-inclusive. Secondly, chunks of data are unusable, given security and privacy concerns. With laws such as GDPR in the EU and several bills in the US protecting citizens’ data, the engineering team has limited data to train AI models. Synthetic data solves these problems and more. Since it is engineered, it can be created in tons of amounts required for the training while being labelled and cleaned of biases. Furthermore, this data is completely anonymous, overcoming the problems of anonymising personal data that can be hacked.
In other ways, synthetic data helps for faster and better training of AI models, given the power for teams to generate datasets quickly. In addition, given this data is manufactured, it passes through the stages of data cleaning and maintaining, further saving on time and costs. For example, Paul Walborsky, the first synthetic data service co-founder, told NVIDIA, “a single image that could cost $6 from a labelling service can be artificially generated for six cents”.
Additionally, Lebaredian illustrated NVIDIA’s experience to claim this. Firstly, to train a model to play dominoes, the team would have to buy hundreds of dominoes sets, collect them, set them in different environments, conditions, sensors and lightings, and then label the data. Alternatively, they trained a model to generate dominoes, which worked efficiently randomly. Secondly, Lebaredian factored in the impossibility of getting real data to ensure the accuracy and diversity needed to train self-driving cars. “There’s really no way around it. Without physically accurate simulation to generate the data we need for these AIs, there’s no way we’re going to progress,” he said.
Removing the rose-tinted glass
The importance of synthetic data being discussed is not an all-encompassing solution to the ethical and quantitative debates for the dataset. The synthetic data is only as unbiased as the real-life dataset it is based on. It also brings in the issue of the uncanny valley. Currently, the gap between real and synthetic data limits the real-world performance of machine learning models trained only in simulation.