Active Hackathon

What is the Out-of-bag (OOB) score of bagging models?

The OOB score is a prediction error on each of the bootstrap samples
Listen to this story

Bagging is an ensemble learning strategy that helps machine learning algorithms increase their performance and accuracy. It is used to cope with bias-variance trade-offs and decreases a prediction model’s variance. Bagging is also known as bootstrap aggregation. Out-of-bag (OOB) observations are not included in the bootstrap sample or subsample. The OOB observations are used for estimating the prediction error of the bagging algorithm, yielding the so-called OOB error. This article will focus on understanding OOB error/score in a bagging algorithm. Following are the topics to be addressed.

Table of contents

  1. What is an OOB score?
  2. How does OOB error work?
  3. Random forest with OOB score

The OOB error is frequently cited as an unbiased approximation of the genuine error rate. Let’s start by talking about OOB errors.


Sign up for your weekly dose of what's up in emerging technology.

What is an OOB error?

Multiple trees are built on the bootstrap samples, and the resulting predictions are averaged. This ensemble method, known as a random forest, often outperforms using a single tree. During the bootstrap process, random resamples of variables and records are often taken. The prediction error on each of the bootstrap samples is known as the OOB score. It is used to fine-tune the model’s parameters. With classification and regression trees.

For example, tree depth is crucial – how far should the tree grow? If the tree is grown to its full depth, predictive power would be reduced. There are high chances of overfitting the data if the tree is grown to full depth (which produces an increased error in predicting new data). In each bootstrap cycle, the OOB score for trees of varying depths may be computed, and the minimum-error depth is recorded. 

When to use

As known that only a subset of the Decision Tree is used for determining the OOB score. This reduces the total aggregation impact of bagging. Thus in general, validation on a full ensemble of Decision Trees is better than a subset of Decision Trees for estimating the score. However, occasionally the dataset is not big enough and hence setting aside a part of it for validation is unaffordable. Consequently, in cases where a large dataset is not available and want to consume it all as the training dataset, the OOB score provides a good trade-off. Nonetheless, it should be noted that the validation score and OOB score are unalike, computed differently and should not be thus compared.

Are you looking for a complete repository of Python libraries used in data science, check out here.

How does OOB error work?

When bootstrap aggregation is used, two separate sets are produced. The data chosen to be “in-the-bag” by sampling with replacement is one set, the bootstrap sample. The out-of-bag set contains all data that was not picked during the sampling procedure.

When this procedure is repeated, such as when developing a random forest, numerous bootstrap samples and OOB sets are generated. The OOB sets can be combined into a single dataset, however, each sample is only considered out-of-bag for trees that do not include it in their bootstrap sample. The diagram below demonstrates that the data for each bag collected is divided into two categories.

Because each out-of-bag set is not used to train the model, it is an excellent test of the model’s performance. The particular computation of OOB error is dependent on the model’s implementation, however, a generic calculation is as follows.

  • Identify any models (or trees in the case of a random forest) that have not been trained by the OOB instance.
  • Take the majority vote of the outcomes of these models for the OOB instance, and compare it to the real value of the OOB instance.
  • Compile the OOB error for all OOB dataset instances.

The bagging process may be tailored to a model’s specifications. The bootstrap training sample size should be near to that of the original set to achieve an accurate model. The number of iterations (trees) of the model (forest) should also be considered when determining the genuine OOB fault. Because the OOB error will settle after many iterations, it is best, to begin with, a large number of iterations.

Bagging model with OOB score

This article uses a random forest for the bagging model in particular using the random forest classifier. The data set is related to health and fitness, the data contains parameters noted by the Apple Watch and Fitbit watch and tried to classify activities according to those parameters.

Let’s start with the data reading and preprocessing

data.drop(['Unnamed: 0','X1'],axis=1,inplace=True)
data_aw=data[data['device']=='apple watch']

The data is collected by two different devices apple watch and Fitbit, therefore needs to be separate. So separating the data based on the device. The data has a categorical variable that is needed to be encoded before the data is processed for training the model.

from sklearn.preprocessing import LabelEncoder

Splitting the data into test and train maintaining the ratio of 30:70 respectively.

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.30, random_state=123)

Building the model

rfc_best=RandomForestClassifier(random_state=42,oob_score=True,       criterion='entropy',max_depth=8,max_features='sqrt',

The best parameters for the random forest are searched using the Random search CV and by turning on the ‘oob_score’ we could retrieve the OOB error rate of the model on the train data set. By using that score we will get an idea of the accuracy of the model before using other metrics like precision, recall, etc.,y_train)
Analytics India Magazine

To understand the effect of tunning of the model, compare the tunned model’s OOB score with the baseline model’s OOB score.

Analytics India Magazine

We can observe that there is a huge difference between the tunned and baseline model.

Let’s deep dive into the performance of the random forest model by using different metrics to calculate the performance of the model on the unseen data.

print('Recall score',np.round(recall_score(y_test,prediction,average='weighted'),3))
print('Precision score',np.round(precision_score(y_test,prediction,average='weighted'),3))
print('Area under the ROC',np.round(roc_auc_score(y_test,rfc_best.predict_proba(X_test),average='weighted',multi_class='ovr'),3))
Analytics India Magazine

The recall score and precision score are almost identical 0.72 which is also the oob_score of the model and with the area under the ROC curve of 0.93, we could say that the model has done pretty well in predicting the labels.


The out-of-bag (OOB) error is a way of calculating the prediction error of machine learning models that use bootstrap aggregation (bagging) and other, boosted decision trees. But there is a possibility that OOB error could be biased while estimating the error. With this article, we have understood the OOB error and its interpretability using Random forest.


More Great AIM Stories

Sourabh Mehta
Sourabh has worked as a full-time data scientist for an ISP organisation, experienced in analysing patterns and their implementation in product development. He has a keen interest in developing solutions for real-time problems with the help of data both in this universe and metaverse.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM