Guide To Synthetic Data Vault: An Ecosystem Of Synthetic Data Generation Libraries

Synthetic Data Vault (SDV) is a collection of libraries for generating synthetic data for Machine Learning tasks.
synthetic data vault

Synthetic Data Vault (SDV) is a collection of libraries for generating synthetic data for Machine Learning tasks. It enables modeling of tabular and time-series datasets that can then be used to synthesise new data resembling the original ones in terms of format and statistical properties. SDV was introduced by Neha Patki, Roy Wedge and Kalyan Veeramachaneni – researchers at CSAIL and LIDS, MIT (research paper).

A brief introduction of Synthetic Data Vault can also be found in one of our previous articles (weblink). This article talks about the open-source project in a bit more detail, along with its practical example using Python code.

Synthetic data generated using SDV can be used as additional information while training Machine Learning models (data augmentation). times, it can even be used in place of the original data since they both remain identical to each other. It also maintains the original data integrity, i.e. the original data does not get disclosed to the user seeing its synthetic version. SDV uses recursive sampling methods and hierarchical models for data generation in order to allow a wide range of structures for storing the synthetic data. 


Sign up for your weekly dose of what's up in emerging technology.

The SDV project is still in its development stage. The functionalities it provides till dates have been summarized in the following section.

Highlighting features of SDV 

SDV provides synthetic data generators for creating new data from the following types of data:

  1. Data contained in a single table
  • SDV can handle missing data and multiple types of data with a minimal amount of input required.
  • SDV can handle various types of data constraints and validations.
  • SDV can employ GAN-based deep learning model (CTGAN) as well as GaussianCopulas model for handling such single-table data.
  1. Data spread across multiple tables and relational datasets
  • SDV used the GaussianCopulas model and recursive methods for data sampling to synthesize new data from mutli-tabular data.
  • SDV maintains relational metadata of all the constituent tables of the data along with a metadata schema.
  1. Time-series data
  • SDV uses various autoregressive, statistical and deep learning models for data synthesis from multivariate data comprising a time series.
  • SDV allows data sampling to be done based on certain specific conditions based on the attributes suitable to the problem’s context.

Practical implementation

Here’s a demonstration of implementing the Hierarchical Modelling Algorithm (HMA) for generating new data using Synthetic Data Vault. HMA enables recursive scanning of a relational dataset. It applies tabular model, allowing it in the dataset, allowing the model to learn relationships between different fields of all the tables. The code has been implemented using Google colab with Python 3.7.10 and sdv 0.9.0 versions. Step-wise explanation of the code is as follows:

  1. Install the sdv library.

!pip install sdv 

  1. Import the load_demo() method of sdv.demo module for loading the relational demo data. 

from sdv import load_demo

IMP NOTE: On executing the above import statement, you may get an error as follows:

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

Uninstalling the NumPy library and reinstalling it resolves this error.

 !pip uninstall numpy    #uninstall NumPy
 !pip install numpy       #install it again 

Now, import the load_data() method and the execution will be successful.

from sdv import load_demo

  1. Load the relational data.

md, tb = load_demo(metadata=True)

load_data() with ‘metadata’ parameter set to ‘True’ will be tuple consisting an instance having metadata of the dataset and a dictionary having data tables loaded in the form of Pandas DataFrames.

 md, tb = load_demo(metadata=True)
 #’md’ contains metadata, ‘tb’ contains the tables 
  1. Display the metadata and dictionary of tables.





  1. Visualize the metadata in order to know the relationship between the data tables.



The output figure shows the primary key and foreign for each of the three tables. It also shows the parent-child relationships among them i.e. ‘transactions’ is the child of ‘sessions’ table which again is the child of ‘users’ table.

  1. Import the HMA1 class of sdv.relational module.

from sdv.relational import HMA1

  1. Instantiate the HMA class by passing the metadata of relational datasets as parameter.

hma_model = HMA1(md)

  1. Fit the HMA model to the relational tables’ data.

NOTE: When the fit() method is executed, the SDV scans all the tables in the order of relationship amazon them. It learns every child table using the GaussianCopula Model. Before learning the parent table, SDV augments it using copula parameters of its child table so that the model can learn how the relation between rows of the parent table and those of its child table(s).

  1. Synthesize new data from the HMA model.

synth_data = hma_model.sample()

A new dictionary of tables resembling those in the original ‘tb’ dictionary will be created. The tables will contain new data identical to that of the original relational tables.

 #Display the new synthesized data.


  1. Save the HMA model.'hma_mod.pkl')

Load the saved model.

load_model = HMA1.load('hma_mod.pkl')

NOTE: The saved model’s file ‘hma_model.pkl’ will be smaller in size than the original data file. The reason being, the HMA model does not have any information of the original data. It only uses the copula parameters for creating a new version of the data. Thus, the saved model’s file can be shared anywhere without revealing any details about the original data.

  1. Sample the newly synthesized data from the ‘load_model’ instance.

newdt = load_model.sample()

  1. Display the synthetic data.



  1. In step (11), the number of rows to be sampled was not mentioned so the synthetic data contains as many rows as the original tables. However, customized number of rows can be specified follows:

hma_model.sample('sessions', num_rows=6)

Since the above line of code specifies ‘sessions’, the synthetic data will only contain the ‘sessions’ table and its child i.e. the ‘transactions’ table.


  1. By setting the ‘sample_children’ parameter of the sample() method to FALSE, a parent table alone can be used to synthesize data.

hma_model.sample('users', num_rows=6, sample_children=False)



More Great AIM Stories

Nikita Shiledarbaxi
A zealous learner aspiring to advance in the domain of AI/ML. Eager to grasp emerging techniques to get insights from data and hence explore realistic Data Science applications as well.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM