Now Reading
Guide To Synthetic Data Vault: An Ecosystem Of Synthetic Data Generation Libraries

Guide To Synthetic Data Vault: An Ecosystem Of Synthetic Data Generation Libraries

synthetic data vault

Synthetic Data Vault (SDV) is a collection of libraries for generating synthetic data for Machine Learning tasks. It enables modeling of tabular and time-series datasets that can then be used to synthesise new data resembling the original ones in terms of format and statistical properties. SDV was introduced by Neha Patki, Roy Wedge and Kalyan Veeramachaneni – researchers at CSAIL and LIDS, MIT (research paper).

A brief introduction of Synthetic Data Vault can also be found in one of our previous articles (weblink). This article talks about the open-source project in a bit more detail, along with its practical example using Python code.

Synthetic data generated using SDV can be used as additional information while training Machine Learning models (data augmentation). times, it can even be used in place of the original data since they both remain identical to each other. It also maintains the original data integrity, i.e. the original data does not get disclosed to the user seeing its synthetic version. SDV uses recursive sampling methods and hierarchical models for data generation in order to allow a wide range of structures for storing the synthetic data. 

The SDV project is still in its development stage. The functionalities it provides till dates have been summarized in the following section.

Highlighting features of SDV 

SDV provides synthetic data generators for creating new data from the following types of data:

  1. Data contained in a single table
  • SDV can handle missing data and multiple types of data with a minimal amount of input required.
  • SDV can handle various types of data constraints and validations.
  • SDV can employ GAN-based deep learning model (CTGAN) as well as GaussianCopulas model for handling such single-table data.
  1. Data spread across multiple tables and relational datasets
  • SDV used the GaussianCopulas model and recursive methods for data sampling to synthesize new data from mutli-tabular data.
  • SDV maintains relational metadata of all the constituent tables of the data along with a metadata schema.
  1. Time-series data
  • SDV uses various autoregressive, statistical and deep learning models for data synthesis from multivariate data comprising a time series.
  • SDV allows data sampling to be done based on certain specific conditions based on the attributes suitable to the problem’s context.

Practical implementation

Here’s a demonstration of implementing the Hierarchical Modelling Algorithm (HMA) for generating new data using Synthetic Data Vault. HMA enables recursive scanning of a relational dataset. It applies tabular model, allowing it in the dataset, allowing the model to learn relationships between different fields of all the tables. The code has been implemented using Google colab with Python 3.7.10 and sdv 0.9.0 versions. Step-wise explanation of the code is as follows:

  1. Install the sdv library.

!pip install sdv 

  1. Import the load_demo() method of sdv.demo module for loading the relational demo data. 

from sdv import load_demo

IMP NOTE: On executing the above import statement, you may get an error as follows:

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

Uninstalling the NumPy library and reinstalling it resolves this error.

 !pip uninstall numpy    #uninstall NumPy
 !pip install numpy       #install it again 

Now, import the load_data() method and the execution will be successful.

from sdv import load_demo

  1. Load the relational data.

md, tb = load_demo(metadata=True)

load_data() with ‘metadata’ parameter set to ‘True’ will be tuple consisting an instance having metadata of the dataset and a dictionary having data tables loaded in the form of Pandas DataFrames.

 md, tb = load_demo(metadata=True)
 #’md’ contains metadata, ‘tb’ contains the tables 
  1. Display the metadata and dictionary of tables.

md

Output:

tb

Output:

  1. Visualize the metadata in order to know the relationship between the data tables.

md.visualize()

Output:

The output figure shows the primary key and foreign for each of the three tables. It also shows the parent-child relationships among them i.e. ‘transactions’ is the child of ‘sessions’ table which again is the child of ‘users’ table.

  1. Import the HMA1 class of sdv.relational module.

from sdv.relational import HMA1

  1. Instantiate the HMA class by passing the metadata of relational datasets as parameter.

hma_model = HMA1(md)

  1. Fit the HMA model to the relational tables’ data.

hma_model.fit(tb)

NOTE: When the fit() method is executed, the SDV scans all the tables in the order of relationship amazon them. It learns every child table using the GaussianCopula Model. Before learning the parent table, SDV augments it using copula parameters of its child table so that the model can learn how the relation between rows of the parent table and those of its child table(s).

  1. Synthesize new data from the HMA model.

synth_data = hma_model.sample()

A new dictionary of tables resembling those in the original ‘tb’ dictionary will be created. The tables will contain new data identical to that of the original relational tables.

 #Display the new synthesized data.
 synth_data 

Output:

  1. Save the HMA model. 

hma_model.save('hma_mod.pkl')

Load the saved model.

load_model = HMA1.load('hma_mod.pkl')

NOTE: The saved model’s file ‘hma_model.pkl’ will be smaller in size than the original data file. The reason being, the HMA model does not have any information of the original data. It only uses the copula parameters for creating a new version of the data. Thus, the saved model’s file can be shared anywhere without revealing any details about the original data.

  1. Sample the newly synthesized data from the ‘load_model’ instance.

newdt = load_model.sample()

  1. Display the synthetic data.

newdt

Output:

  1. In step (11), the number of rows to be sampled was not mentioned so the synthetic data contains as many rows as the original tables. However, customized number of rows can be specified follows:

hma_model.sample('sessions', num_rows=6)

Since the above line of code specifies ‘sessions’, the synthetic data will only contain the ‘sessions’ table and its child i.e. the ‘transactions’ table.

Output:

  1. By setting the ‘sample_children’ parameter of the sample() method to FALSE, a parent table alone can be used to synthesize data.

hma_model.sample('users', num_rows=6, sample_children=False)

Output:

References

What Do You Think?

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join our Telegram Group. Be part of an engaging community

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top