Double debiased machine learning for evaluation and inference

Debiased ML combines bias correction and sample splitting to compute scalar summaries.
Listen to this story

When having a large number of independent variables, then processes like feature selection and feature elimination are probably relevant to be utilized for selecting the most meaningful. In some circumstances, however, there is a high chance of introducing a bias known as regularization, pre-test, or feature selection bias. This article will be focused on understanding the functionality of double debiased machine learning algorithms for eliminating bias and generating a causal inference related to bias. Following are the topics to be covered.

Table of contents

  1. What is double/debiased ML?
  2. How does double/debiased ML work?
  3. Why is sample splitting important?
  4. Implementing the double/debiased ML with DoubleML

The double machine learning approach is a combination of orthogonalized machine learning and sample splitting. Let’s start with understanding double/debiased machine learning.

What is double/debiased ML?

Debiased machine learning is a meta method that uses bias correction and sample splitting to compute confidence intervals for machine learning functionals (i.e. scalar summaries).

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

The algorithm provides a nonasymptotic debiased machine learning theorem that applies to every global or local functional of any machine learning algorithm that meets a few basic, interpretable constraints. The algorithm leads to a simple set of requirements that may be used by users to transform contemporary learning theory rates into classic statistical inference. 

What is the objective?

In a very complicated environment, the Double/Debiased machine learning technique provides a straightforward and generic way for estimating and inferring the low-dimensional parameter of interest.

The high complexity indicates that the entropy of the nuisance parameter’s parameter space increases with sample size. The parameter of interest is usually a causal or treatment effect parameter.

Are you looking for a complete repository of Python libraries used in data science, check out here.

How does double/debiased ML work?

The main goal is to provide a general framework for estimating and doing inference about a low dimensional parameter in the presence of a high dimensional nuisance parameter which may be estimated with new non-parametric statistical methods.

  • The nuisance parameter is a secondary population parameter that must be accounted for in order to produce an estimated value for a primary parameter.

In a regression problem, a naive approach to the estimation of the parameter of interest using ML methods would be, for example, to construct a sophisticated ML estimator for learning the regression function. Suppose, for the sake of clarity, that data is randomly split into two equal parts (main and auxiliary samples). The sample splitting plays a major role in overcoming the bias. 

Now there would be two estimates that would be obtained using the auxiliary sample and the main sample. The estimator obtained from the main sample will generally have a slower rate of convergence.  The driving force behind this “inferior” behaviour is the bias in learning from the auxiliary sample

This behaviour is obtained due to certain terms that have a non-zero mean. These terms have a non-zero mean because, in high dimensional or otherwise highly complex settings, we must employ regularized estimators – such as lasso, ridge, boosting, or penalized neural nets – for informative learning to be feasible. Regularization in these estimators prevents the estimator’s variance from ballooning, but it also introduces serious biases into the estimate. Specifically, the rate of convergence of the bias of the estimator in the root mean squared error sense will typically be less than ‘1’. Hence, the sum of terms which have a non-zero mean would be in a stochastic order which tends to infinity. 

But regularizing the biases generates another bias which is known as regularized bias. To mitigate this kind of bias the algorithm is using an “orthogonalized” formulation. This formulation is obtained directly by partialling out the effect of independent variables from the treatment variable to obtain the orthogonalized regressor. 

  • Partialling is a statistical method for controlling the impact of a variable or collection of factors on other variables of interest (typically, the dependent and independent variables). Partialling aids in the clarification of a given connection by removing the influence of other variables that may be connected.

In particular, the orthogonal regressor computed includes an ML estimator derived from the auxiliary sample of data. Now, using independent variables, solve an auxiliary prediction problem to estimate the conditional mean of the treatment variable (parameter of interest). In the end, it’s a “double prediction” or “double machine learning.”

Following the partialling and estimation of the estimator produced from the auxiliary sample, the following “debiased” machine learning estimator is built for the estimator acquired from the main sample of observations.

Why is sample splitting important?

The use of sample-splitting plays a key role in establishing that remainder terms vanish in probability. The whole equation of the estimation error is divided into three components which are 

  • The leading term contains the debiased machine learning estimator.
  • The second term captures the impact of regularization bias in estimating the estimator of different samples.
  • The remainder term contains the normalized sums of products of structural unobservables from the model used like a lasso, tree-based.

The remainder term needs to be shown to vanish in probability to make the estimation balanced. The use of sample splitting allows simple and tight control of such terms. 

To see this, assume that observations are independent and recall that the estimator obtained from observations in the auxiliary sample. Then, conditioning on the auxiliary sample it is easy to verify that the remainder term has a mean of zero and variance tending to 0. Thus, the term vanishes in probability by Chebyshev’s inequality. 

While sample splitting allows for the handling of residual terms, its direct application has the disadvantage that the estimator of the parameter of interest only uses the main sample, which may result in a significant loss of efficiency because we are only using a part of the available data.

It may, however, be possible to reverse the roles of the main and auxiliary samples in order to generate a second version of the estimator of the parameter of interest. We can restore complete efficiency by averaging the two resultant estimators. Because the two estimators will be roughly independent, just averaging them is an efficient technique. This sample splitting procedure where the roles of main and auxiliary samples are swapped to obtain multiple estimates and then average the results is called cross-fitting.

Implementing the double/debiased ML with DoubleML

This article uses regression data so we will be comparing the results of DoubleML for the Decision tree Regressor, Random Forest Regressor and XGBoost Regressor.

The data set used for this article is related to life expectancy released by WHO from 2000 to 2015. It has a total of 22  features including the target variable which is “Expectancy”. The target variable contains data related to the expected age of the population for different countries based on the different health care features.

DoubleML is a Python and R module that implements the double/debiased machine learning framework. The Python package is based on sci-kit learn, whereas the R package is based on mlr3 and the mlr3 ecosystem. It simplifies the usage of the Double/Debiased algorithm.

Installing Double ML

! pip install DoubleML

Import necessary libraries

import numpy as np
import pandas as pd
import doubleml as dml
 
from xgboost import XGBClassifier, XGBRegressor
from sklearn.ensemble import  RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
 
import matplotlib.pyplot as plt
import seaborn as sns

Reading and preprocessing the data

data=pd.read_csv('Life Expectancy Data.csv')
data.head()
Analytics India Magazine
data_utils=data.dropna(axis=0)
 
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
data_utils['status_enc']=encoder.fit_transform(data_utils['Status'])

Initialize data backend

Let’s use the package DoubleML to estimate the average treatment effect of expectancy of life (‘Life expectancy’), on adult mortality rate (‘adult mortality’).

features_base=['infant deaths','Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure',
       'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population',
       ' thinness  1-19 years', ' thinness 5-9 years',
       'Income composition of resources', 'Schooling', 'status_enc', 'Year']
 
data_dml_base = dml.DoubleMLData(data_utils,
                                 y_col='Adult Mortality',
                                 d_cols='Life expectancy ',
                                 x_cols=features_base)

The ‘DoubleMLData’ function needs to define the data, the treatment variable, in this case, its ‘Life expectancy’ the output column in this case which is ‘Adult Mortality’ and a list of features to be used as an independent variable.

Analytics India Magazine

Building Double ML model

This article requires only the use of Partial Linear Regression which is used when the treatment variable(s) is/are regression.

The Double ML provides four models: Partially linear regression models (PLR), Partially linear IV regression models (PLIV), Interactive regression models (IRM), and Interactive IV regression models (IIVM).

trees = DecisionTreeRegressor(
    max_depth=30, ccp_alpha=0.0047, min_samples_split=203, min_samples_leaf=67)
 
np.random.seed(123)
dml_plr_tree = dml.DoubleMLPLR(data_dml_base,
                               ml_l = trees,
                               ml_m = trees,
                               n_folds = 3)
dml_plr_tree.fit(store_predictions=True)
tree_summary_plr = dml_plr_tree.summary
 
tree_summary_plr
Analytics India Magazine

This PLR is for decision tree regressor similarly two more models would be built one for random forest regressor and the other for XG boost regressor.

randomForest = RandomForestRegressor(
    n_estimators=500, max_depth=7, max_features=3, min_samples_leaf=3)
np.random.seed(123)
dml_plr_forest = dml.DoubleMLPLR(data_dml_base,
                                 ml_l = randomForest,
                                 ml_m = randomForest,
                                 n_folds = 3)
dml_plr_forest.fit(store_predictions=True)
forest_summary_plr = dml_plr_forest.summary
 
forest_summary_plr
Analytics India Magazine
boost = XGBRegressor(n_jobs=1, objective = "reg:squarederror",
                     eta=0.1, n_estimators=35)
 
np.random.seed(123)
dml_plr_boost = dml.DoubleMLPLR(data_dml_base,
                                ml_l = boost,
                                ml_m = boost,
                                n_folds = 3)
dml_plr_boost.fit(store_predictions=True)
boost_summary_plr = dml_plr_boost.summary
boost_summary_plr
Analytics India Magazine

Comparing the results

final_summary = pd.concat(( forest_summary_plr, tree_summary_plr, boost_summary_plr))
final_summary.index = [ 'forest', 'tree', 'xgboost']
final_summary[['coef', '2.5 %', '97.5 %']]
Analytics India Magazine

Here we can observe that the XG boost regressor clearly performs well compared to the other two, overall all the models are clearly defining the effect of life expectancy on adult mortality.

Conclusion

Double/Debiased machine learning is a combined operation of orthogonal function and sample splitting which generates a statistical inference for the variable of interest. With this article, we have understood the functionality of double/debiased machine learning and its implementation of it with DoubleML.

References

More Great AIM Stories

Sourabh Mehta
Sourabh has worked as a full-time data scientist for an ISP organisation, experienced in analysing patterns and their implementation in product development. He has a keen interest in developing solutions for real-time problems with the help of data both in this universe and metaverse.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM