Listen to this story
When having a large number of independent variables, then processes like feature selection and feature elimination are probably relevant to be utilized for selecting the most meaningful. In some circumstances, however, there is a high chance of introducing a bias known as regularization, pre-test, or feature selection bias. This article will be focused on understanding the functionality of double debiased machine learning algorithms for eliminating bias and generating a causal inference related to bias. Following are the topics to be covered.
Table of contents
- What is double/debiased ML?
- How does double/debiased ML work?
- Why is sample splitting important?
- Implementing the double/debiased ML with DoubleML
The double machine learning approach is a combination of orthogonalized machine learning and sample splitting. Let’s start with understanding double/debiased machine learning.
Sign up for your weekly dose of what's up in emerging technology.
What is double/debiased ML?
Debiased machine learning is a meta method that uses bias correction and sample splitting to compute confidence intervals for machine learning functionals (i.e. scalar summaries).
The algorithm provides a nonasymptotic debiased machine learning theorem that applies to every global or local functional of any machine learning algorithm that meets a few basic, interpretable constraints. The algorithm leads to a simple set of requirements that may be used by users to transform contemporary learning theory rates into classic statistical inference.
What is the objective?
In a very complicated environment, the Double/Debiased machine learning technique provides a straightforward and generic way for estimating and inferring the low-dimensional parameter of interest.
The high complexity indicates that the entropy of the nuisance parameter’s parameter space increases with sample size. The parameter of interest is usually a causal or treatment effect parameter.
Are you looking for a complete repository of Python libraries used in data science, check out here.
How does double/debiased ML work?
The main goal is to provide a general framework for estimating and doing inference about a low dimensional parameter in the presence of a high dimensional nuisance parameter which may be estimated with new non-parametric statistical methods.
- The nuisance parameter is a secondary population parameter that must be accounted for in order to produce an estimated value for a primary parameter.
In a regression problem, a naive approach to the estimation of the parameter of interest using ML methods would be, for example, to construct a sophisticated ML estimator for learning the regression function. Suppose, for the sake of clarity, that data is randomly split into two equal parts (main and auxiliary samples). The sample splitting plays a major role in overcoming the bias.
Now there would be two estimates that would be obtained using the auxiliary sample and the main sample. The estimator obtained from the main sample will generally have a slower rate of convergence. The driving force behind this “inferior” behaviour is the bias in learning from the auxiliary sample
This behaviour is obtained due to certain terms that have a non-zero mean. These terms have a non-zero mean because, in high dimensional or otherwise highly complex settings, we must employ regularized estimators – such as lasso, ridge, boosting, or penalized neural nets – for informative learning to be feasible. Regularization in these estimators prevents the estimator’s variance from ballooning, but it also introduces serious biases into the estimate. Specifically, the rate of convergence of the bias of the estimator in the root mean squared error sense will typically be less than ‘1’. Hence, the sum of terms which have a non-zero mean would be in a stochastic order which tends to infinity.
But regularizing the biases generates another bias which is known as regularized bias. To mitigate this kind of bias the algorithm is using an “orthogonalized” formulation. This formulation is obtained directly by partialling out the effect of independent variables from the treatment variable to obtain the orthogonalized regressor.
- Partialling is a statistical method for controlling the impact of a variable or collection of factors on other variables of interest (typically, the dependent and independent variables). Partialling aids in the clarification of a given connection by removing the influence of other variables that may be connected.
In particular, the orthogonal regressor computed includes an ML estimator derived from the auxiliary sample of data. Now, using independent variables, solve an auxiliary prediction problem to estimate the conditional mean of the treatment variable (parameter of interest). In the end, it’s a “double prediction” or “double machine learning.”
Following the partialling and estimation of the estimator produced from the auxiliary sample, the following “debiased” machine learning estimator is built for the estimator acquired from the main sample of observations.
Why is sample splitting important?
The use of sample-splitting plays a key role in establishing that remainder terms vanish in probability. The whole equation of the estimation error is divided into three components which are
- The leading term contains the debiased machine learning estimator.
- The second term captures the impact of regularization bias in estimating the estimator of different samples.
- The remainder term contains the normalized sums of products of structural unobservables from the model used like a lasso, tree-based.
The remainder term needs to be shown to vanish in probability to make the estimation balanced. The use of sample splitting allows simple and tight control of such terms.
To see this, assume that observations are independent and recall that the estimator obtained from observations in the auxiliary sample. Then, conditioning on the auxiliary sample it is easy to verify that the remainder term has a mean of zero and variance tending to 0. Thus, the term vanishes in probability by Chebyshev’s inequality.
While sample splitting allows for the handling of residual terms, its direct application has the disadvantage that the estimator of the parameter of interest only uses the main sample, which may result in a significant loss of efficiency because we are only using a part of the available data.
It may, however, be possible to reverse the roles of the main and auxiliary samples in order to generate a second version of the estimator of the parameter of interest. We can restore complete efficiency by averaging the two resultant estimators. Because the two estimators will be roughly independent, just averaging them is an efficient technique. This sample splitting procedure where the roles of main and auxiliary samples are swapped to obtain multiple estimates and then average the results is called cross-fitting.
Implementing the double/debiased ML with DoubleML
This article uses regression data so we will be comparing the results of DoubleML for the Decision tree Regressor, Random Forest Regressor and XGBoost Regressor.
The data set used for this article is related to life expectancy released by WHO from 2000 to 2015. It has a total of 22 features including the target variable which is “Expectancy”. The target variable contains data related to the expected age of the population for different countries based on the different health care features.
DoubleML is a Python and R module that implements the double/debiased machine learning framework. The Python package is based on sci-kit learn, whereas the R package is based on mlr3 and the mlr3 ecosystem. It simplifies the usage of the Double/Debiased algorithm.
Installing Double ML
! pip install DoubleML
Import necessary libraries
import numpy as np import pandas as pd import doubleml as dml from xgboost import XGBClassifier, XGBRegressor from sklearn.ensemble import RandomForestRegressor from sklearn.tree import DecisionTreeRegressor import matplotlib.pyplot as plt import seaborn as sns
Reading and preprocessing the data
data=pd.read_csv('Life Expectancy Data.csv') data.head()
data_utils=data.dropna(axis=0) from sklearn.preprocessing import LabelEncoder encoder=LabelEncoder() data_utils['status_enc']=encoder.fit_transform(data_utils['Status'])
Initialize data backend
Let’s use the package DoubleML to estimate the average treatment effect of expectancy of life (‘Life expectancy’), on adult mortality rate (‘adult mortality’).
features_base=['infant deaths','Alcohol', 'percentage expenditure', 'Hepatitis B', 'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure', 'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population', ' thinness 1-19 years', ' thinness 5-9 years', 'Income composition of resources', 'Schooling', 'status_enc', 'Year'] data_dml_base = dml.DoubleMLData(data_utils, y_col='Adult Mortality', d_cols='Life expectancy ', x_cols=features_base)
The ‘DoubleMLData’ function needs to define the data, the treatment variable, in this case, its ‘Life expectancy’ the output column in this case which is ‘Adult Mortality’ and a list of features to be used as an independent variable.
Building Double ML model
This article requires only the use of Partial Linear Regression which is used when the treatment variable(s) is/are regression.
The Double ML provides four models: Partially linear regression models (PLR), Partially linear IV regression models (PLIV), Interactive regression models (IRM), and Interactive IV regression models (IIVM).
trees = DecisionTreeRegressor( max_depth=30, ccp_alpha=0.0047, min_samples_split=203, min_samples_leaf=67) np.random.seed(123) dml_plr_tree = dml.DoubleMLPLR(data_dml_base, ml_l = trees, ml_m = trees, n_folds = 3) dml_plr_tree.fit(store_predictions=True) tree_summary_plr = dml_plr_tree.summary tree_summary_plr
This PLR is for decision tree regressor similarly two more models would be built one for random forest regressor and the other for XG boost regressor.
randomForest = RandomForestRegressor( n_estimators=500, max_depth=7, max_features=3, min_samples_leaf=3) np.random.seed(123) dml_plr_forest = dml.DoubleMLPLR(data_dml_base, ml_l = randomForest, ml_m = randomForest, n_folds = 3) dml_plr_forest.fit(store_predictions=True) forest_summary_plr = dml_plr_forest.summary forest_summary_plr
boost = XGBRegressor(n_jobs=1, objective = "reg:squarederror", eta=0.1, n_estimators=35) np.random.seed(123) dml_plr_boost = dml.DoubleMLPLR(data_dml_base, ml_l = boost, ml_m = boost, n_folds = 3) dml_plr_boost.fit(store_predictions=True) boost_summary_plr = dml_plr_boost.summary boost_summary_plr
Comparing the results
final_summary = pd.concat(( forest_summary_plr, tree_summary_plr, boost_summary_plr)) final_summary.index = [ 'forest', 'tree', 'xgboost'] final_summary[['coef', '2.5 %', '97.5 %']]
Here we can observe that the XG boost regressor clearly performs well compared to the other two, overall all the models are clearly defining the effect of life expectancy on adult mortality.
Double/Debiased machine learning is a combined operation of orthogonal function and sample splitting which generates a statistical inference for the variable of interest. With this article, we have understood the functionality of double/debiased machine learning and its implementation of it with DoubleML.