Featuretools is an open-source Python library designed for automated feature engineering. It was developed by the Feature Labs. It enables the creation of new features from several related data tables. Feature selection techniques can then be used to choose appropriate features from them and then data scientists can proceed with model creation.
Image source: GitHub
Before moving to the practical implementation of Featuretools, let us have a quick overview of some essential concepts for performing automatic feature engineering using Featuretools.
What is an entity and an entityset?
An entity is just a table of data or a Pandas data frame in Python code. The observations are recorded as rows while its columns denote different features. An entityset is a collection of related entities. The creation of new features becomes easier using an entityset since it shows multiple tables and relationships between them – all in one place. Entities and entitysets are independent of the underlying data so these abstractions can be applied to any dataset.
How Featuretools deals with variable types?
Featuretools can infer the types of variables on its own. However, for cases such as Boolean variable’s values stored as integers(0 or1), we need to explicitly identify and assign the datatype such as variable_types.Boolean. Visit this page to know more about the variable types that Featuretools deals with.
What are relationships among entities?
The relationship is not a distinguishing feature of Featuretools, it is the same abstract concept used in relational database management systems (RDBMS). A relationship can be of various types such as one-to-one, one-to-many, many-to-one and many-to-many. A parent-child relationship among datasets is an example of a one-to-many relationship in which a parent dataset can be related to multiple other datasets, each of which is called a child dataset.
What is a feature primitive?
An operation applied to a data frame for creating new features is termed a ‘feature primitive’. It involves simple computations which can be combined for the creation of complex features. The two major feature primitives that we have used in the practical implementation are aggregation and transformation.
Aggregation: It groups children of a parent table for statistical computation such as minimum, maximum, mean and standard deviation across them.
Transformation: It is an operation performed on one or more columns of one table, e.g. computing the difference between two columns’ values.
Refer to this page for detailed information on feature primitives.
What is Deep Feature Synthesis (DFS)?
DFS is a method used by Featuretools for creating new features. To perform DFS, dfs() function of the featuretools library is used. It takes as input an entityset, a target entity, where the new features should be stored, aggregation and transformation primitives to be used for feature creation, and other parameters. Setting the ‘features_only’ parameter of dfs() to True only creates the features’ names but does not compute its actual values (known as feature matrix).
Practical implementation
Here’s a demonstration of implementing automated feature engineering using Featuretools for a supervised machine learning classification task that aims to predict whether or not a loan application of a financial institution named ‘Home Credit’ will default on the loan (‘Default’ means the client fails to repay the loan). The ‘Home Credit Default Risk’ dataset used here is available on Kaggle (weblink to download it). The code has been implemented using Python 3.7.10 and featuretools 0.23.2 versions. Step-wise explanation of the code is as follows:
- Install Featuretools library
!pip install featuretools
- Import required libraries and modules
import numpy as np import pandas as pd import featuretools as ft import featuretools.variable_types as vtypes
- Read the data files
#Files containing training and test data for each client app_train = pd.read_csv('application_train.csv').replace({365243: np.nan}) app_test = pd.read_csv('application_test.csv').replace({365243: np.nan}) “”” File containing data of clients’ previous credits from financial institutions other than Home Credit “”” bureau = pd.read_csv('bureau.csv').replace({365243: np.nan}) #File containing data about monthly balance about the credits bureau_balance = pd.read_csv('bureau_balance.csv').replace({365243: np.nan}) “”” File containing monthly data about cash loans or previous point of sale from previous loan data “”” cash = pd.read_csv('POS_CASH_balance.csv').replace({365243: np.nan}) #File containing data regarding previous credit card loans credit = pd.read_csv('credit_card_balance.csv').replace({365243: np.nan}) #File having data of previous loan applications at Home Credit previous = pd.read_csv('previous_application.csv').replace({365243: np.nan}) #File containing data about payment history for Home Credit’s previous loans installments = pd.read_csv('installments_payments.csv').replace({365243: np.nan})
- Join the training set and test set so that same features can be built for both
#Create a target column in test set before merging app_test['TARGET'] = np.nan # Append test set to the training set app = app_train.append(app_test, ignore_index = True, sort = True)
- Convert floating point indexes to integer type for adding relationships
#for each floating point index for index in ['SK_ID_CURR', 'SK_ID_PREV', 'SK_ID_BUREAU']: #for each of the dataframes for dataset in [app, bureau, bureau_balance, cash, credit, previous, installments]: #convert if the index is one of the columns of the dataframe if index in list(dataset.columns): #Fill null records with 0 and change datatype to integer dataset[index] = dataset[index].fillna(0).astype(np.int64)
- Identify the boolean variables which are recorded as integers (0.0 or 1.0)
#Create a list to specify boolean type for ‘app’ data app_types = {} # For each column in the dataset for col in app: #if the column has 2 unique values and is of numeric type if (app[col].nunique() == 2) and (app[col].dtype == float): #assign the type as Boolean app_types[col] = vtypes.Boolean # Remove the TARGET member from the ‘app_types’ list del app_types['TARGET'] #Display the number of Boolean variables print('There are {} Boolean variables in the application data.'.format(len(app_types)))
Output: There are 32 Boolean variables in the application data.
- Assign ‘Oridinal’ datatype to the columns of ‘app’ data which can have ordered discrete values
app_types['REGION_RATING_CLIENT'] = vtypes.Ordinal app_types['REGION_RATING_CLIENT_W_CITY'] = vtypes.Ordinal app_types['HOUR_APPR_PROCESS_START'] = vtypes.Ordinal
- As done for ‘app’ data, identify the Boolean variables of ‘previous’ data (defined in step (3))
#Create list to specify the datatype previous_types = {} # For each column in the ‘previous’ data for col in previous: #If the column has 2 unique values (0.0 or 1.0) and its datatype is numeric if (previous[col].nunique() == 2) and (previous[col].dtype == float): #Assign the datatype as Boolean previous_types[col] = vtypes.Boolean #Display the number of Boolean variables in the ‘previous’ data print('There are {} Boolean variables in the previous data.'.format(len(previous_types)))
Output: There are 1 Boolean variables in the previous data.
- The ‘credit’, ‘cash’ and ‘installments’ data SK_ID_CURR variable. That variable is not needed since we will link these three dataframes to the ‘app’ data through ‘previous’ data using SK_ID_PREV variable.
installments = installments.drop(columns = ['SK_ID_CURR']) credit = credit.drop(columns = ['SK_ID_CURR']) cash = cash.drop(columns = ['SK_ID_CURR'])
- Add the seven data tables to the entity set. entity_from_dataframe() method loads the data for a specific entity from a specified dataframe.
es = es.entity_from_dataframe(entity_id = 'app', dataframe = app, index = 'SK_ID_CURR',variable_types = app_types) es = es.entity_from_dataframe(entity_id = 'bureau', dataframe = bureau, index = 'SK_ID_BUREAU') es = es.entity_from_dataframe(entity_id = 'previous', dataframe = previous, index = 'SK_ID_PREV',variable_types = previous_types) “”” For entities which do not have a unique index, create index too by setting ‘make_index’ to True “”” es = es.entity_from_dataframe(entity_id = 'bureau_balance', dataframe = bureau_balance, make_index = True, index = 'bureaubalance_index') es = es.entity_from_dataframe(entity_id = 'cash', dataframe = cash, make_index = True, index = 'cash_index') es = es.entity_from_dataframe(entity_id = 'installments', dataframe = installments,make_index = True, index = 'installments_index') es = es.entity_from_dataframe(entity_id = 'credit', dataframe = credit, make_index = True, index = 'credit_index') #Display the entityset es
Output:
- Identify relationships: e.g. ‘app’ dataframe has a single record for each client identified by the key SK_ID_CURR. The ‘bureau’ dataframe has multiple records for each client. So ‘app’ dataframe is the parent while ‘bureau’ dataframe is the child in parent-shild relationship among tables. Display there relationship:
print('Parent: app, Parent Variable of bureau: SK_ID_CURR\n\n', app.iloc[:, 111:115].head())
Output:
- The ‘buraeau’ and ‘buraeu_balance’ dataframes are linked through a shared variable called SK_ID_BUREAU. This variable is called ‘parent variable’ in the parent table bureau while ‘child variable’ in the child table bureau_balance.
print('Parent: bureau, Parent Variable of bureau_balance: SK_ID_BUREAU\n\n', bureau.iloc[:, :5].head()) print('\nChild: bureau_balance, Child Variable of bureau: SK_ID_BUREAU\n\n', bureau_balance.head())
Output:
- Add new relationships among the dataframes to be added to the entityset. Relationship class enables representing relationships between various entities.
# app_train and bureau relation r_app_bureau = ft.Relationship(es['app']['SK_ID_CURR'], es['bureau']['SK_ID_CURR']) # bureau and bureau_balance relation r_bureau_balance = ft.Relationship(es['bureau']['SK_ID_BUREAU'], es['bureau_balance']['SK_ID_BUREAU']) # current app and previous app realtion r_app_previous = ft.Relationship(es['app']['SK_ID_CURR'], es['previous']['SK_ID_CURR']) #cash, installments, and credit’s relation with previous app r_previous_cash = ft.Relationship(es['previous']['SK_ID_PREV'], es['cash']['SK_ID_PREV']) r_previous_installments = ft.Relationship(es['previous']['SK_ID_PREV'], es['installments']['SK_ID_PREV']) r_previous_credit = ft.Relationship(es['previous']['SK_ID_PREV'], es['credit']['SK_ID_PREV'])
- Add the above-created relationships to the entityset:
es = es.add_relationships([r_app_bureau, r_bureau_balance, r_app_previous,r_previous_cash, r_previous_installments, r_previous_credit]) #Print the modified entityset es
Output:
- Visualize the entityset.
es.plot()
Sample condensed output:
16. Record the feature primitives
primitives = ft.list_primitives() #Set maximum column width for displaying the primitives pd.options.display.max_colwidth = 100 Display the records with aggregation primitive primitives[primitives['type'] == 'aggregation'].head(10)
Output:
Display the records with transformation primitive.
primitives[primitives['type'] == 'transform'].head(10)
Output:

17. Build new features using default primitives of featuretools.
Specify the default aggregation and transformation primitives.
#Aggregation primitives default_agg_primitives = ["sum", "std", "max", "skew", "min", "mean", "count", "percent_true", "num_unique", "mode"] #Transformation primitives default_trans_primitives = ["day", "year", "month", "weekday", "haversine", "num_words", "num_characters"]
Create new features from a list of relationships and a dictionary of entities using dfs() method.
feature_names = ft.dfs(entityset = es, target_entity = 'app', trans_primitives = default_trans_primitives, agg_primitives=default_agg_primitives, where_primitives = [], seed_features = [], max_depth = 2, n_jobs = -1, verbose = 1, features_only=True)
Output: Built 2089 features
18. Display some of the newly generated features
feature_names[1050:1070]
Output:
- Code source: GitHub
- Google colab notebook of the above implementation
References
For an in-depth understanding of Featuretools, refer to the following sources:
- Official website
- Documentation
- GitHub repository
- Practical use cases of Featuretools with source code