# A Primer to Ensemble Learning – Bagging and Boosting

Ensemble is a machine learning concept in which multiple models are trained using the same learning algorithm. Bagging is a way to decrease the variance in the prediction by generating additional data for training from dataset using combinations with repetitions to produce multi-sets of the original data. Boosting is an iterative technique which adjusts the weight of an observation based on the last classification. If an observation was classified incorrectly, it tries to increase the weight of this observation. Boosting in general builds strong predictive models.

# Ensemble Methods

### Definition and Objective of Analysis

Definition: Ensemble methods combine several decision trees classifiers to produce better predictive performance than a single decision tree classifier. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner, thus increasing the accuracy of the model.

Objective of Analysis: Minimization of risk and maximization of profit for the bank. To minimize loss from the bank’s perspective, the bank needs a decision rule regarding loan approval. An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application. The dataset contains data on 13 variables and the classification whether an applicant is considered a credit worthy (1) or a non-credit worthy (0) for 1000 loan applicants. A predictive model developed on this data should provide the bank manager guidance for making a decision whether to approve a loan to a prospective applicant based on his/her profile.

#### THE BELAMY

##### Sign up for your weekly dose of what's up in emerging technology.

Cost-Profit Consideration: The statistical decisions must be translated into profit consideration for the bank. Let us assume:

• Correct decision: A correct decision of the bank manager would result in 35% profit at the end of 5 years. A correct decision here means that the bank manager predicts an application to be good or credit-worthy and it actually turns out to be credit worthy.
• Wrong decision: When the bank manager predicts the application to be good but it turns out to be bad credit, then the loss is 100%.
• If the bank manager predicts an application to be non-creditworthy, then loan facility is not extended to that applicant and bank does not incur any loss. The cost matrix, therefore, is as follows:

### Dataset Source and Contents

The following is a description of the dataset:

• of Classes: 2 (‘Credit Worthy’ and ‘Non-Credit Worthy’)
• of attributes (Columns): 13
• of instances (Rows): 1,000

This data was extracted from the census bureau database found at:

http://archive.ics.uci.edu/ml/datasets

# Bagging and Boosting

### Bagging

 Partitioning of data Random Goal to achieve Minimum variance Methods used Random subspace Functions to combine single model Weighted average Example Random Forest

Definition:

Bagging is used when the goal is to reduce the variance of a decision tree classifier. Here the objective is to create several subsets of data from training sample chosen randomly with replacement. Each collection of subset data is used to train their decision trees. As a result, we get an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree classifier.

Bagging Steps:

• Suppose there are N observations and M features in training data set. A sample from training data set is taken randomly with replacement.
• A subset of M features are selected randomly and whichever feature gives the best split is used to split the node iteratively.
• The tree is grown to the largest.
• Above steps are repeated n times and prediction is given based on the aggregation of predictions from n number of trees.

Advantages:

• Reduces over-fitting of the model.
• Handles higher dimensionality data very well.
• Maintains accuracy for missing data.

Disadvantages:

• Since final prediction is based on the mean predictions from subset trees, it won’t give precise values for the classification and regression model.

Python Syntax:

• rfm = RandomForestClassifier(n_estimators=80, oob_score=True, n_jobs=-1, random_state=101, max_features = 0.50, min_samples_leaf = 5)
• fit(x_train, y_train)
• predicted = rfm.predict_proba(x_test)

### Boosting

 Partitioning of data Higher vote to misclassified samples Goal to achieve Increase accuracy Methods used Gradient descent Functions to combine single model Weighted majority vote Example Ada Boost

Definition:

Boosting is used to create a collection of predictors. In this technique, learners are learned sequentially with early learners fitting simple models to the data and then analysing data for errors. Consecutive trees (random sample) are fit and at every step, the goal is to improve the accuracy from the prior tree. When an input is misclassified by a hypothesis, its weight is increased so that next hypothesis is more likely to classify it correctly. This process converts weak learners into better performing model.

Boosting Steps:

• Draw a random subset of training samples d1 without replacement from the training set D to train a weak learner C1
• Draw second random training subset d2 without replacement from the training set and add 50 percent of the samples that were previously falsely classified/misclassified to train a weak learner C2
• Find the training samples d3 in the training set D on which C1 and C2 disagree to train a third weak learner C3
• Combine all the weak learners via majority voting.

Advantages:

• Supports different loss function (we have used ‘binary:logistic’ for this example).
• Works well with interactions.

Disadvantages:

• Prone to over-fitting.
• Requires careful tuning of different hyper-parameters.

Python Syntax:

• from xgboost import XGBClassifier
• xgb = XGBClassifier(objective=’binary:logistic’, n_estimators=70, seed=101)
• fit(x_train, y_train)
• predicted = xgb.predict_proba(x_test)

# Conclusion

### Comparison Matrix

• Accuracy: (True Positive + True Negative) / Total Population
• Accuracy is a ratio of correctly predicted observation to the total observations. Accuracy is the most intuitive performance measure.
• True Positive: The number of correct predictions that the occurrence is positive
• True Negative: The number of correct predictions that the occurrence is negative
• F1-Score: (2 x Precision x Recall) / (Precision + Recall)
• F1-Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. F1-Score is usually more useful than accuracy, especially if you have an uneven class distribution.
• Precision: When a positive value is predicted, how often is the prediction correct?
• Recall: When the actual value is positive, how often is the prediction correct?
 Classification Algorithms Accuracy F1-Score Decision Tree 76.67 % 0.74 Bagging 77.67 % 0.77 Boosting 78.67 % 0.78

Code location: https://github.com/f2005636/Ensemble

### Cost-Profit Consideration

Although the improvement in Accuracy and F1-Score for Bagging and Boosting is not significant when compared to the Decision Tree, significant improvement is seen for overall profit.

NCW (0): Non-Credit Worthy

CW (1): Credit Worthy

 Classification Algorithms Calculation Profit Decision Tree 0.35 * 196 -1.00 * 56 12.60 Bagging 0.35 * 185 -1.00 * 42 22.75 Boosting 0.35 * 186 -1.00 * 40 25.10

Rs. 299/month

## More Great AIM Stories

### Top Machine Learning Model Interpretation Tools

Rohit Garg has close to 7 years of work experience in field of data analytics and machine learning. He has worked extensively in the areas of predictive modeling, time series analysis and segmentation techniques. Rohit holds BE from BITS Pilani and PGDM from IIM Raipur.

## AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

### Telegram group

Discover special offers, top stories, upcoming events, and more.

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### Google updates its LaMDA language model

LaMDA is built by fine-tuning a family of Transformer-based neural language models specialised for dialog, with up to 137B model parameters.

### Yann LeCun sparks a debate on AGI vs human-level AI

Yann LeCun claimed the word AGI should be retired and must be replaced with “human-level AI”.

### Do economists make good data scientists?

What we refer to as coding skills for data science are in fact the ability to think logically and understand underlying data structures.

### IBM sells parts of Watson Health; what are the repercussions?

IBM Watson Health was an ambitious project introduced to use the core AI platform to help healthcare professionals analyse large amounts of data and assist in cancer treatment.

### How Indian AI patents get stuck in red tape

From 2015 to 2018, Indian companies have filed over 4,600 patents in the US, of which 64.8% are technology patents.

### Why is it raining IPOs in the analytics space?

Research shows close to 1000 companies going public, raising \$315 billion as of late December – and smashing the previous record of less than \$200 billion.

### Is AI2’s Macaw better than GPT-3?

If a bird didn’t have wings, how would it be affected?
Macaw: It would be unable to fly
GPT-3: It would be a bad bird.

### How language models perfected plagiarism to an art

Today, most institutions employ text-matching software to counteract plagiarism.

### Behind Meta’s claim of building world’s fastest AI Supercomputer

Meta has released the AI Research SuperCluster (RSC), calling it one of the fastest AI supercomputers running presently in the world.

### How Cryptogenomics realises data anonymization in genetic research

Stanford professor Gill Bejerano developed a method to analyse the DNA of large numbers of patients without storing or holding the DNA samples in a database.