Closing the gap between real and synthetic data

Despite the massive opportunities that synthetic data brings to the table, one of the main challenges it faces is the reality gap that exists.
Synthetic Data

Synthetic data was listed among the top five biggest data science trends for 2022, and Gartner named it among the top strategic predictions for this year. In a world that is highly driven by data, privacy and process issues often limit the kind of data researchers might require. A promising way out here is artificially generated data or synthetic data

Various algorithms and tools are used to generate this synthetic data which is then used in a wide variety of applications. When used properly, synthetic data can be a good addition to human-annotated data while maintaining the speed and cost factors of the project. 

Despite the massive opportunities that synthetic data brings to the table, one of the main challenges it faces is the reality gap that exists. A neural network can tell the difference between simulation and reality. This domain gap, which is also referred to the as uncanny valley, limits the real-world performance of machine learning models trained only in simulation. Closing the gap is important to research and practical challenge for the effective use of synthetic data.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Domain randomisation

Real-world data often contains a large amount of variability. To match up to this variability even in synthetic data generation, researchers are increasingly depending on domain randomisation. Speaking particularly about computer vision applications, domain randomisation can help randomise parameters like lighting, pose, object textures, etc. 

Domain randomisation has been viewed as an alternative to high-fidelity synthetic images. The domain randomisation technique was first introduced by Josh Tobin and his team via a paper titled “Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World”. In this paper, the researchers defined domain randomisation as a promising method for addressing the reality gap, where the simulator is randomised to expose the model to a range of environments instead of just one at training time. The team worked on the hypothesis that if the variability in simulation is big enough, the models trained in a simulation will generalise to the real world without additional training.

In 2018, researchers from NVIDIA presented a data randomisation approach to train a neural network to accomplish complex tasks like object detection. Results of this technique were found to be comparable with more expensive and labour intensive datasets. In this technique, synthetic images were randomly perturbed during training while focusing on relevant features. They were able to demonstrate that domain randomisation outperforms more photorealistic datasets and improves performance on results obtained using real data alone.

A slight improvement over domain randomisation is structured domain randomisation. It takes into account the structure and the context of a scene. Unlike domain randomisation, which places objects and distractors randomly according to a uniform probability distribution, structured domain randomisation places objects and distractors according to the probability distributions with respect to the specific problem at hand. This approach helps neural networks in taking the context into consideration during detection tasks.

Domain adaption

Despite the popularity of domain randomisation, the technique requires a domain expert to define the parts that must stay invariant. Conversely, increasing photorealism requires an artist to model the specific domains in detail, which increases the cost of generating data. The whole exercise defeats the entire cost-effectiveness aspect – a major selling point of synthetic data.

Enter domain adaption.

Domain adaption is an approach that makes a model that is trained on one domain of data to work well even with a different target domain. One of the most popular domain adaptation techniques is the usage of GANs. Conditional GANs, in particular, take additional inputs to the condition generated output. The image conditional GANs form a general-purpose framework for image-to-image translation problems. Conditional GAN was proposed in late 2014. In this technique, the GAN architecture is modified by adding label y as a parameter to the input of the generator module. This architecture generates corresponding data points while adding labels to the discriminator input to distinguish real data better.

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at

Download our Mobile App


AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry


Strengthen Critical AI Skills with Trusted Corporate AI Training

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.