MITB Banner

Introduction To Featuretools: A Python Framework For Automated Feature Engineering

Share

Featuretools

Featuretools is an open-source Python library designed for automated feature engineering. It was developed by the Feature Labs. It enables the creation of new features from several related data tables. Feature selection techniques can then be used to choose appropriate features from them and then data scientists can proceed with model creation.

Featuretools working

Image source: GitHub

Before moving to the practical implementation of Featuretools, let us have a quick overview of some essential concepts for performing automatic feature engineering using Featuretools.

What is an entity and an entityset?

An entity is just a table of data or a Pandas data frame in Python code. The observations are recorded as rows while its columns denote different features. An entityset is a collection of related entities. The creation of new features becomes easier using an entityset since it shows multiple tables and relationships between them – all in one place. Entities and entitysets are independent of the underlying data so these abstractions can be applied to any dataset.

How Featuretools deals with variable types?

Featuretools can infer the types of variables on its own. However, for cases such as Boolean variable’s values stored as integers(0 or1), we need to explicitly identify and assign the datatype such as variable_types.Boolean. Visit this page to know more about the variable types that Featuretools deals with.

What are relationships among entities?

The relationship is not a distinguishing feature of Featuretools, it is the same abstract concept used in relational database management systems (RDBMS). A relationship can be of various types such as one-to-one, one-to-many, many-to-one and many-to-many. A parent-child relationship among datasets is an example of a one-to-many relationship in which a parent dataset can be related to multiple other datasets, each of which is called a child dataset.

What is a feature primitive?

An operation applied to a data frame for creating new features is termed a ‘feature primitive’. It involves simple computations which can be combined for the creation of complex features. The two major feature primitives that we have used in the practical implementation are aggregation and transformation.

Aggregation: It groups children of a parent table for statistical computation such as minimum, maximum, mean and standard deviation across them.

Transformation: It is an operation performed on one or more columns of one table, e.g. computing the difference between two columns’ values.

Refer to this page for detailed information on feature primitives.

What is Deep Feature Synthesis (DFS)?

DFS is a method used by Featuretools for creating new features. To perform DFS, dfs() function of the featuretools library is used. It takes as input an entityset, a target entity, where the new features should be stored, aggregation and transformation primitives to be used for feature creation, and other parameters. Setting the ‘features_only’ parameter of dfs() to True only creates the features’ names but does not compute its actual values (known as feature matrix). 

Practical implementation 

Here’s a demonstration of implementing automated feature engineering using Featuretools for a supervised machine learning classification task that aims to predict whether or not a loan application of a financial institution named ‘Home Credit’ will default on the loan (‘Default’ means the client fails to repay the loan). The ‘Home Credit Default Risk’ dataset used here is available on Kaggle (weblink to download it). The code has been implemented using Python 3.7.10 and featuretools 0.23.2 versions. Step-wise explanation of the code is as follows:

  1. Install Featuretools library

!pip install featuretools

  1. Import required libraries and modules
 import numpy as np
 import pandas as pd
 import featuretools as ft
 import featuretools.variable_types as vtypes 
  1. Read the data files
 #Files containing training and test data for each client
 app_train = pd.read_csv('application_train.csv').replace({365243: np.nan})
 app_test = pd.read_csv('application_test.csv').replace({365243: np.nan})
 “””
 File containing data of clients’ previous credits from financial institutions other than Home Credit
 “””
 bureau = pd.read_csv('bureau.csv').replace({365243: np.nan})
 #File containing data about monthly balance about the credits
 bureau_balance = pd.read_csv('bureau_balance.csv').replace({365243: np.nan})
 “””
 File containing monthly data about cash loans or previous point of sale from previous loan data 
 “””
 cash = pd.read_csv('POS_CASH_balance.csv').replace({365243: np.nan})
 #File containing data regarding previous credit card loans
 credit = pd.read_csv('credit_card_balance.csv').replace({365243: np.nan})
 #File having data of previous loan applications at Home Credit
 previous = pd.read_csv('previous_application.csv').replace({365243: np.nan})
#File containing data about payment history for Home Credit’s previous loans
installments = pd.read_csv('installments_payments.csv').replace({365243: np.nan}) 
  1. Join the training set and test set so that same features can be built for both
 #Create a target column in test set before merging
 app_test['TARGET'] = np.nan
 # Append test set to the training set
 app = app_train.append(app_test, ignore_index = True, sort = True) 
  1. Convert floating point indexes to integer type for adding relationships
 #for each floating point index
 for index in ['SK_ID_CURR', 'SK_ID_PREV', 'SK_ID_BUREAU']:
    #for each of the dataframes
     for dataset in [app, bureau, bureau_balance, cash, credit, 
     previous, installments]:
     #convert if the index is one of the columns of the dataframe
         if index in list(dataset.columns):
   #Fill null records with 0 and change datatype to integer
             dataset[index] = dataset[index].fillna(0).astype(np.int64) 
  1. Identify the boolean variables which are recorded as integers (0.0 or 1.0)
 #Create a list to specify boolean type for ‘app’ data
 app_types = {}
 # For each column in the dataset
 for col in app:
 #if the column has 2 unique values and is of numeric type 
     if (app[col].nunique() == 2) and (app[col].dtype == float):
      #assign the type as Boolean 
         app_types[col] = vtypes.Boolean
 # Remove the TARGET member from the ‘app_types’ list
 del app_types['TARGET']
 #Display the number of Boolean variables
print('There are {} Boolean variables in the application data.'.format(len(app_types))) 

Output: There are 32 Boolean variables in the application data.

  1. Assign ‘Oridinal’ datatype to the columns of ‘app’ data which can have ordered discrete values
 app_types['REGION_RATING_CLIENT'] = vtypes.Ordinal
 app_types['REGION_RATING_CLIENT_W_CITY'] = vtypes.Ordinal
 app_types['HOUR_APPR_PROCESS_START'] = vtypes.Ordinal 
  1. As done for ‘app’ data, identify the Boolean variables of ‘previous’ data (defined in step (3))
 #Create list to specify the datatype
 previous_types = {}
 # For each column in the ‘previous’ data
 for col in previous:
 #If the column has 2 unique values (0.0 or 1.0) and its datatype is numeric
     if (previous[col].nunique() == 2) and (previous[col].dtype == float):
  #Assign the datatype as Boolean
         previous_types[col] = vtypes.Boolean
 #Display the number of Boolean variables in the ‘previous’ data
 print('There are {} Boolean variables in the previous data.'.format(len(previous_types))) 

Output: There are 1 Boolean variables in the previous data.

  1. The ‘credit’, ‘cash’ and ‘installments’ data SK_ID_CURR variable. That variable is not needed since we will link these three dataframes to the ‘app’ data through ‘previous’ data using SK_ID_PREV variable.
 installments = installments.drop(columns = ['SK_ID_CURR'])
 credit = credit.drop(columns = ['SK_ID_CURR'])
 cash = cash.drop(columns = ['SK_ID_CURR']) 
  1.   Add the seven data tables to the entity set. entity_from_dataframe() method loads the data for a specific entity from a specified dataframe.
  es = es.entity_from_dataframe(entity_id = 'app', dataframe = app, index = 
  'SK_ID_CURR',variable_types = app_types)
 es = es.entity_from_dataframe(entity_id = 'bureau', dataframe = bureau, index = 'SK_ID_BUREAU')
 es = es.entity_from_dataframe(entity_id = 'previous', dataframe = previous, index = 'SK_ID_PREV',variable_types = previous_types)
 “””
For entities which do not have a unique index, create index too by setting ‘make_index’ to True
 “””
 es = es.entity_from_dataframe(entity_id = 'bureau_balance', dataframe = bureau_balance, make_index = True, index = 'bureaubalance_index')
 es = es.entity_from_dataframe(entity_id = 'cash', dataframe = cash, 
                               make_index = True, index = 'cash_index')
 es = es.entity_from_dataframe(entity_id = 'installments', dataframe = installments,make_index = True, index = 'installments_index')
 es = es.entity_from_dataframe(entity_id = 'credit', dataframe = credit,
 make_index = True, index = 'credit_index')
 #Display the entityset
 es 

Output:

Featuretools op1
  1. Identify relationships: e.g. ‘app’ dataframe has a single record for each client identified by the key SK_ID_CURR. The ‘bureau’ dataframe has multiple records for each client. So ‘app’ dataframe is the parent while ‘bureau’ dataframe is the child in parent-shild relationship among tables. Display there relationship:
print('Parent: app, Parent Variable of bureau: SK_ID_CURR\n\n', app.iloc[:, 111:115].head())

Output:

Featuretools op2
  1.  The ‘buraeau’ and ‘buraeu_balance’ dataframes are linked through a shared variable called SK_ID_BUREAU. This variable is called ‘parent variable’ in the parent table bureau while ‘child variable’ in the child table bureau_balance.
print('Parent: bureau, Parent Variable of bureau_balance: SK_ID_BUREAU\n\n', bureau.iloc[:, :5].head())
print('\nChild: bureau_balance, Child Variable of bureau: SK_ID_BUREAU\n\n', bureau_balance.head()) 

Output:

Featuretools op3
  1.  Add new relationships among the dataframes to be added to the entityset. Relationship class enables representing relationships between various entities. 
 # app_train and bureau relation
 r_app_bureau = ft.Relationship(es['app']['SK_ID_CURR'], es['bureau']['SK_ID_CURR'])
 # bureau and bureau_balance relation
 r_bureau_balance = ft.Relationship(es['bureau']['SK_ID_BUREAU'], es['bureau_balance']['SK_ID_BUREAU'])
 # current app and previous app realtion
 r_app_previous = ft.Relationship(es['app']['SK_ID_CURR'], es['previous']['SK_ID_CURR'])
 #cash, installments, and credit’s relation with previous app
 r_previous_cash = ft.Relationship(es['previous']['SK_ID_PREV'], es['cash']['SK_ID_PREV'])
 r_previous_installments = ft.Relationship(es['previous']['SK_ID_PREV'], es['installments']['SK_ID_PREV'])
 r_previous_credit = ft.Relationship(es['previous']['SK_ID_PREV'], es['credit']['SK_ID_PREV']) 
  1. Add the above-created relationships to the entityset:
es = es.add_relationships([r_app_bureau, r_bureau_balance, r_app_previous,r_previous_cash, r_previous_installments, r_previous_credit])
#Print the modified entityset
es 

Output:

  1. Visualize the entityset.

es.plot()

Sample condensed output:

16. Record the feature primitives 

      primitives = ft.list_primitives()
      #Set maximum column width for displaying the primitives
      pd.options.display.max_colwidth = 100
     Display the records with aggregation primitive
     primitives[primitives['type'] == 'aggregation'].head(10) 

  Output:

Display the records with transformation primitive.

primitives[primitives['type'] == 'transform'].head(10)

Output:

17. Build new features using default primitives of featuretools.

Specify the default aggregation and transformation primitives.

 #Aggregation primitives
 default_agg_primitives =  ["sum", "std", "max", "skew", "min", "mean", "count", "percent_true", "num_unique", "mode"]
 #Transformation primitives
 default_trans_primitives =  ["day", "year", "month", "weekday", "haversine", "num_words", "num_characters"] 

Create new features from a list of relationships and a dictionary of entities using dfs() method.

 feature_names = ft.dfs(entityset = es, target_entity = 'app',
                        trans_primitives = default_trans_primitives,
                        agg_primitives=default_agg_primitives, 
                        where_primitives = [], seed_features = [],
                        max_depth = 2, n_jobs = -1, verbose = 1,
                        features_only=True) 

Output: Built 2089 features

18. Display some of the newly generated features

feature_names[1050:1070]

Output:

final output

References

For an in-depth understanding of Featuretools, refer to the following sources:

Share
Picture of Nikita Shiledarbaxi

Nikita Shiledarbaxi

A zealous learner aspiring to advance in the domain of AI/ML. Eager to grasp emerging techniques to get insights from data and hence explore realistic Data Science applications as well.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India