Last updated April 5, 2021
In AI Mysteries

Introduction To Featuretools: A Python Framework For Automated Feature Engineering

Share

Published on April 5, 2021

by Nikita Shiledarbaxi

Featuretools is an open-source Python library designed for automated feature engineering. It was developed by the Feature Labs. It enables the creation of new features from several related data tables. Feature selection techniques can then be used to choose appropriate features from them and then data scientists can proceed with model creation.

Image source: GitHub

Before moving to the practical implementation of Featuretools, let us have a quick overview of some essential concepts for performing automatic feature engineering using Featuretools.

What is an entity and an entityset?

An entity is just a table of data or a Pandas data frame in Python code. The observations are recorded as rows while its columns denote different features. An entityset is a collection of related entities. The creation of new features becomes easier using an entityset since it shows multiple tables and relationships between them – all in one place. Entities and entitysets are independent of the underlying data so these abstractions can be applied to any dataset.

How Featuretools deals with variable types?

Featuretools can infer the types of variables on its own. However, for cases such as Boolean variable’s values stored as integers(0 or1), we need to explicitly identify and assign the datatype such as variable_types.Boolean. Visit this page to know more about the variable types that Featuretools deals with.

What are relationships among entities?

The relationship is not a distinguishing feature of Featuretools, it is the same abstract concept used in relational database management systems (RDBMS). A relationship can be of various types such as one-to-one, one-to-many, many-to-one and many-to-many. A parent-child relationship among datasets is an example of a one-to-many relationship in which a parent dataset can be related to multiple other datasets, each of which is called a child dataset.

What is a feature primitive?

An operation applied to a data frame for creating new features is termed a ‘feature primitive’. It involves simple computations which can be combined for the creation of complex features. The two major feature primitives that we have used in the practical implementation are aggregation and transformation.

Aggregation: It groups children of a parent table for statistical computation such as minimum, maximum, mean and standard deviation across them.

Transformation: It is an operation performed on one or more columns of one table, e.g. computing the difference between two columns’ values.

Refer to this page for detailed information on feature primitives.

What is Deep Feature Synthesis (DFS)?

DFS is a method used by Featuretools for creating new features. To perform DFS, dfs() function of the featuretools library is used. It takes as input an entityset, a target entity, where the new features should be stored, aggregation and transformation primitives to be used for feature creation, and other parameters. Setting the ‘features_only’ parameter of dfs() to True only creates the features’ names but does not compute its actual values (known as feature matrix).

Practical implementation

Here’s a demonstration of implementing automated feature engineering using Featuretools for a supervised machine learning classification task that aims to predict whether or not a loan application of a financial institution named ‘Home Credit’ will default on the loan (‘Default’ means the client fails to repay the loan). The ‘Home Credit Default Risk’ dataset used here is available on Kaggle (weblink to download it). The code has been implemented using Python 3.7.10 and featuretools 0.23.2 versions. Step-wise explanation of the code is as follows:

Install Featuretools library

!pip install featuretools

Import required libraries and modules

 import numpy as np
 import pandas as pd
 import featuretools as ft
 import featuretools.variable_types as vtypes

Read the data files

 #Files containing training and test data for each client
 app_train = pd.read_csv('application_train.csv').replace({365243: np.nan})
 app_test = pd.read_csv('application_test.csv').replace({365243: np.nan})
 “””
 File containing data of clients’ previous credits from financial institutions other than Home Credit
 “””
 bureau = pd.read_csv('bureau.csv').replace({365243: np.nan})
 #File containing data about monthly balance about the credits
 bureau_balance = pd.read_csv('bureau_balance.csv').replace({365243: np.nan})
 “””
 File containing monthly data about cash loans or previous point of sale from previous loan data 
 “””
 cash = pd.read_csv('POS_CASH_balance.csv').replace({365243: np.nan})
 #File containing data regarding previous credit card loans
 credit = pd.read_csv('credit_card_balance.csv').replace({365243: np.nan})
 #File having data of previous loan applications at Home Credit
 previous = pd.read_csv('previous_application.csv').replace({365243: np.nan})
#File containing data about payment history for Home Credit’s previous loans
installments = pd.read_csv('installments_payments.csv').replace({365243: np.nan})

Join the training set and test set so that same features can be built for both

 #Create a target column in test set before merging
 app_test['TARGET'] = np.nan
 # Append test set to the training set
 app = app_train.append(app_test, ignore_index = True, sort = True)

Convert floating point indexes to integer type for adding relationships

 #for each floating point index
 for index in ['SK_ID_CURR', 'SK_ID_PREV', 'SK_ID_BUREAU']:
    #for each of the dataframes
     for dataset in [app, bureau, bureau_balance, cash, credit, 
     previous, installments]:
     #convert if the index is one of the columns of the dataframe
         if index in list(dataset.columns):
   #Fill null records with 0 and change datatype to integer
             dataset[index] = dataset[index].fillna(0).astype(np.int64)

Identify the boolean variables which are recorded as integers (0.0 or 1.0)

 #Create a list to specify boolean type for ‘app’ data
 app_types = {}
 # For each column in the dataset
 for col in app:
 #if the column has 2 unique values and is of numeric type 
     if (app[col].nunique() == 2) and (app[col].dtype == float):
      #assign the type as Boolean 
         app_types[col] = vtypes.Boolean
 # Remove the TARGET member from the ‘app_types’ list
 del app_types['TARGET']
 #Display the number of Boolean variables
print('There are {} Boolean variables in the application data.'.format(len(app_types)))

Output: There are 32 Boolean variables in the application data.

Assign ‘Oridinal’ datatype to the columns of ‘app’ data which can have ordered discrete values

 app_types['REGION_RATING_CLIENT'] = vtypes.Ordinal
 app_types['REGION_RATING_CLIENT_W_CITY'] = vtypes.Ordinal
 app_types['HOUR_APPR_PROCESS_START'] = vtypes.Ordinal

As done for ‘app’ data, identify the Boolean variables of ‘previous’ data (defined in step (3))

 #Create list to specify the datatype
 previous_types = {}
 # For each column in the ‘previous’ data
 for col in previous:
 #If the column has 2 unique values (0.0 or 1.0) and its datatype is numeric
     if (previous[col].nunique() == 2) and (previous[col].dtype == float):
  #Assign the datatype as Boolean
         previous_types[col] = vtypes.Boolean
 #Display the number of Boolean variables in the ‘previous’ data
 print('There are {} Boolean variables in the previous data.'.format(len(previous_types)))

Output: There are 1 Boolean variables in the previous data.

The ‘credit’, ‘cash’ and ‘installments’ data SK_ID_CURR variable. That variable is not needed since we will link these three dataframes to the ‘app’ data through ‘previous’ data using SK_ID_PREV variable.

 installments = installments.drop(columns = ['SK_ID_CURR'])
 credit = credit.drop(columns = ['SK_ID_CURR'])
 cash = cash.drop(columns = ['SK_ID_CURR'])

Add the seven data tables to the entity set. entity_from_dataframe() method loads the data for a specific entity from a specified dataframe.

  es = es.entity_from_dataframe(entity_id = 'app', dataframe = app, index = 
  'SK_ID_CURR',variable_types = app_types)
 es = es.entity_from_dataframe(entity_id = 'bureau', dataframe = bureau, index = 'SK_ID_BUREAU')
 es = es.entity_from_dataframe(entity_id = 'previous', dataframe = previous, index = 'SK_ID_PREV',variable_types = previous_types)
 “””
For entities which do not have a unique index, create index too by setting ‘make_index’ to True
 “””
 es = es.entity_from_dataframe(entity_id = 'bureau_balance', dataframe = bureau_balance, make_index = True, index = 'bureaubalance_index')
 es = es.entity_from_dataframe(entity_id = 'cash', dataframe = cash, 
                               make_index = True, index = 'cash_index')
 es = es.entity_from_dataframe(entity_id = 'installments', dataframe = installments,make_index = True, index = 'installments_index')
 es = es.entity_from_dataframe(entity_id = 'credit', dataframe = credit,
 make_index = True, index = 'credit_index')
 #Display the entityset
 es

Output:

Identify relationships: e.g. ‘app’ dataframe has a single record for each client identified by the key SK_ID_CURR. The ‘bureau’ dataframe has multiple records for each client. So ‘app’ dataframe is the parent while ‘bureau’ dataframe is the child in parent-shild relationship among tables. Display there relationship:

print('Parent: app, Parent Variable of bureau: SK_ID_CURR\n\n', app.iloc[:, 111:115].head())

Output:

The ‘buraeau’ and ‘buraeu_balance’ dataframes are linked through a shared variable called SK_ID_BUREAU. This variable is called ‘parent variable’ in the parent table bureau while ‘child variable’ in the child table bureau_balance.

print('Parent: bureau, Parent Variable of bureau_balance: SK_ID_BUREAU\n\n', bureau.iloc[:, :5].head())
print('\nChild: bureau_balance, Child Variable of bureau: SK_ID_BUREAU\n\n', bureau_balance.head())

Output:

Add new relationships among the dataframes to be added to the entityset. Relationship class enables representing relationships between various entities.

 # app_train and bureau relation
 r_app_bureau = ft.Relationship(es['app']['SK_ID_CURR'], es['bureau']['SK_ID_CURR'])
 # bureau and bureau_balance relation
 r_bureau_balance = ft.Relationship(es['bureau']['SK_ID_BUREAU'], es['bureau_balance']['SK_ID_BUREAU'])
 # current app and previous app realtion
 r_app_previous = ft.Relationship(es['app']['SK_ID_CURR'], es['previous']['SK_ID_CURR'])
 #cash, installments, and credit’s relation with previous app
 r_previous_cash = ft.Relationship(es['previous']['SK_ID_PREV'], es['cash']['SK_ID_PREV'])
 r_previous_installments = ft.Relationship(es['previous']['SK_ID_PREV'], es['installments']['SK_ID_PREV'])
 r_previous_credit = ft.Relationship(es['previous']['SK_ID_PREV'], es['credit']['SK_ID_PREV'])

Add the above-created relationships to the entityset:

es = es.add_relationships([r_app_bureau, r_bureau_balance, r_app_previous,r_previous_cash, r_previous_installments, r_previous_credit])
#Print the modified entityset
es

Output:

Visualize the entityset.

es.plot()

Sample condensed output:

16. Record the feature primitives

      primitives = ft.list_primitives()
      #Set maximum column width for displaying the primitives
      pd.options.display.max_colwidth = 100
     Display the records with aggregation primitive
     primitives[primitives['type'] == 'aggregation'].head(10)

Output:

Display the records with transformation primitive.

primitives[primitives['type'] == 'transform'].head(10)

Output:

17. Build new features using default primitives of featuretools.

Specify the default aggregation and transformation primitives.

 #Aggregation primitives
 default_agg_primitives =  ["sum", "std", "max", "skew", "min", "mean", "count", "percent_true", "num_unique", "mode"]
 #Transformation primitives
 default_trans_primitives =  ["day", "year", "month", "weekday", "haversine", "num_words", "num_characters"]

Create new features from a list of relationships and a dictionary of entities using dfs() method.

 feature_names = ft.dfs(entityset = es, target_entity = 'app',
                        trans_primitives = default_trans_primitives,
                        agg_primitives=default_agg_primitives, 
                        where_primitives = [], seed_features = [],
                        max_depth = 2, n_jobs = -1, verbose = 1,
                        features_only=True)

Output: Built 2089 features

18. Display some of the newly generated features

feature_names[1050:1070]

Output: