“Data is the new oil” is a maxim we have heard far too often. Data indeed has massive potential in the information economy and can bring tremendous value if made available in an accurate and timely manner. However, many barriers arise while using data for real-world applications: Sometimes, access is restricted; other times, there is not enough data available to get good results. Many times, the quality of the data is not up to the mark.
Hence, to build data-heavy applications like machine learning algorithms, generating synthetic data has become an essential skill set that a data scientist must possess.
Generating Synthetic Data
Synthetic data can be defined as data not collected from real-world events. Today, specific algorithms are available to generate realistic synthetic data used as a training dataset.
Deep Generative Networks/Models can learn the distribution of training data to generate new data points with some variations. While it is not always possible to learn the models’ exact distribution, algorithms can come close.
To achieve this, one of the most commonly used methods is Generative Adversarial Networks (GANs). GANs use two adversarial models: The generative model (Generator) captures the data distribution, while the discriminative model (Discriminator) estimates the probability of a sample being real or fake.
The Synthetic Data Community has created a Github repository with material related to GANs for synthetic data generation. For queries on the subject, one can even join their Slack channel. The community has also developed tutorials to get you started. Two of them have been discussed below:
The Wasserstein GAN (WGAN) is an extension of the original GAN that Martin Arjovsky introduced in 2017. WGAN promises to improve the stability when training the model and introduces a new loss function that can correlate with the synthetic data generated.
Instead of using a ‘Discriminator’ that can predict or classify the generated event as real or fake, WGAN uses a ‘Critic’ that scores the event on its realness or fakeness. While training a generator, theoretically, we should minimise the distance between the distribution of the data observed in the training dataset and the generated examples.
The changes introduced in the WGAN provide several benefits while training generative networks. The training is more stable as compared to, for instance, the original proposed GAN and less sensitive to the model architecture selection and hyperparameter choices.
Generating tabular data is far simpler than generating data that needs to preserve its temporal characteristics. To generate time-series data successfully, the model must capture the datasets’ features distributions within each time-point and capture those features’ complex dynamics across time.
The TimeGAN was proposed in 2019 by Jinsung Yoon and Daniel Jarret. While other GAN architectures, including WGAN, use unsupervised learning, the TimeGAN architecture introduces the concept of supervised loss – the model is encouraged to capture ‘time conditional distribution’ within the data by using the original data for supervision.
The difference between TimeGAN and GAN architectures used for sequential data is that TimeGANs can use its training to handle a mixed-data setting, where both static attributes and sequential data features can be generated at the same time.
TimeGAN, like WGAN, also has a more stable training process compared to other architectures and is less sensitive to hyperparameter changes.
Why Synthetic Data
The big players already have a stronghold on data and have created monopolies or ‘data-opolies’. Synthetic data generation models can address this power imbalance.
Secondly, the rising number of cyberattacks, especially after the pandemic, has raised privacy and security concerns. The situation is especially worrying when huge amounts of data are stored in one place. By creating synthetic data, organisations can mitigate this risk.
Thirdly, whenever datasets are created, they reflect real-world biases, resulting in the over-representation or under-representation of certain sections of society. The machine learning algorithms based on such datasets amplify such biases resulting in further discrimination. Synthetic data generation can fill in the holes and help in creating unbiased datasets.
Lastly, generating synthetic data is also cost-effective and saves a lot of time.