Random forest or random decision forest is a tree-based ensemble learning method for classification and regression in the data science field. There are various fields like banking and e-commerce where the random forest algorithm can be applied for decision making and to predict behavior and outcomes.
The basic concept behind the random forest algorithms is to consist of various classifiers in one algorithm. Where the random forest consists of many decision trees in it and the decision trees are being prepared by the technique called bagging or bootstrap aggregating. Bagging or bootstrap aggregating is an ensemble meta-algorithm that improves the accuracy of the machine learning algorithms.
From the above paragraph, we can assume the performance of random forest depends on the training of the decision trees presented under it. The accuracy of a random forest is generated by taking the average or mean of the accuracy provided by every decision tree.
Increasing the number of trees under the forest can increase the accuracy of the whole algorithm. Where the random forest is based on the decision tree on the other hand it eradicates the limitation of the decision tree like it reduces the overfitting problem of the model. The scikit-learn provides the algorithm to implement the random forest algorithm with fewer numbers of configurations.
How Does the Random Forest Algorithm Work?
Before going to the working of the random forest we are required to know about the decision tree. Because the decision trees are the building blocks of a random forest. As the name suggests, the decision tree is based on the algorithm where it forms a tree-like structure.
So in the structure of the decision tree, three main components take part to form a tree-like structure.
- Decision node.
- Leaf node.
- Root node.
When a decision tree algorithm works on the data, it divides the data into different branches and also a branch gets segregated into different branches. This segregation continues until the leaf node. And the final leaf node can not be segregated further.
The image below can represent the block structure of the decision tree.
In the image, the nodes represent the features of the data which are going to be used to make the predictions and the arrows represent the branches and the decision node linked to the leaf nodes. The leaf node consists of the result which is made by using a sample from the data.
Entropy and information gain are the basic functions of the decision tree which helps in the building blocks of the decision tree. They are also part of information theory. A basic understanding of the entropy and information gain will help in the overview of the decision tree.
Entropy stands for calculating the uncertainty where information gain is the measure of the uncertainty removed from the target variable.
Basically, information gain uses an independent variable to gain information about the target variable. To estimate the information gain, the algorithm calculates the entropy of an independent variable and the conditional entropy when an independent variable is given with the dependent variable. In this case, the conditional entropy of an independent variable is subtracted from the entropy of the independent variable.
Entropy and information gain is important in the splitting of the branches in the decision tree.
Lates take an example of any product in the supermarket where a user purchases the items depending on the feature and the usability of the item. So if we take it in the decision tree the features and the usability represents a root node and the decision node. The final decision either the user purchasing it or not represents the leaf node or final node. The decision varies according to the price, usability and expiration date. Then the decision tree will look like
As we have discussed, the random forest consists of the various decision tree in it. And the random forest uses the bagging or bootstrap method to generate results. Bagging is an ensemble meta-algorithm that involves the selection of data samples randomly from the data space. Which is then sent to the different decision trees and these decision trees produce different outputs. The highest-ranked output gets selected as the final output.
The random forest algorithm provides data to trees by bootstrap aggregating or bagging. As the name suggests, bagging is a technique in which data gets resampled in smaller bags to train the tree learner. From the given data with dependent and independent variables, bagging selects the random samples and fits trees to the samples.
For b = 1, …, B:
- Sample, with replacement, n training examples from X, Y; call these Xb, Yb.
- Train a classification or regression tree fb on Xb, Yb.
After training, predictions for unseen samples x’ can be made by averaging the predictions from all the individual regression trees on x’:
or by taking the majority vote in the case of classification trees.
Since bagging decreases the variance of the model without increasing the weight it increases the performance of the model. In comparison to a single decision tree where the tree becomes very sensitive because of noise in the training set, the random forest takes the average performance of many trees.
Most of the random trees follow the procedure called feature bagging. The reason it is called feature bootstrapping is that it distributes the important features of the data set to most of the trees so that the combined results provided by the tree can become correlated to the other tree’s results.
It is recommended that √p features can be used for training of a random forest classifier in each split and p/3 features can be used for training of a random forest regression in each split if the count of features is p.
In python, scikit-learn provides the implementation of random forest classifiers and regression.
You can easily implement the model using the following code.
model = RandomForestClassifier()
In which you can fit the following parameters. Here I am discussing some important parameters, for more information about the parameters you can go to this link
- n_estimators: the number of trees you want to create under the forest, it has to be an integer value.
- criterion: this parameter measures the quality of the split. You can use “gini” for the gini impurity or you can use “entropy” for the information gain.
- max_depth: this defines the depth of the tree which means the depth from the root node to the leaf node. It can be integer or float value
- random_state: we use it to control the randomness of the bagging procedure and the sampling of the features. it has to be an integer value.
I have implemented a random forest classifier in scikit learn provided breast cancer Wisconsin dataset in which we have the following two classes:
And following features.
The major concern about any random forest comes when we want to know about the model. After defining all the things well according to the data we just want to know more about the model and how it performed in the background. How different trees have got trained and also if a tree is performing better, how can we extract it from the random forest?
In this situation, we can use the export_graphviz from sklearn.tree library for visualizing the forest and the tree. By visualizing, we will get all the answers to our questions. Some of the visualizations I have implemented are as follows.
Here we can see the 4 trees from our random forest. The image size is so big that’s why it is not clear in the article the reader can access this link for better visualization of the graph.
We can also extract trees from the random forest. The result of tree extraction is as follows.
Here we can see the image of a single decision tree from the random forest.
In any situation, if we require the best results using randomly selected samples from the dataset, by this visualization we can know which tree is performing good and with which sample of the data.
And can easily extract the tree using the following code.
rf = RandomForestClassifier() # first decision tree Rf.estimators_
Here in this article, we have seen how random forest ensembles the decision tree and the bootstrap aggregation with itself. and by visualizing them we got to know about the model. Sometimes in testing, it becomes very crucial to know about the background of the process if any of the failures occur. The algorithms in the bunch of algorithms or programs which can not be accessed by the time of processing are called the black-box algorithm, in our case, the decision tree was an example of a black-box algorithm that we have accessed and seen how it works. I encourage readers to use the procedures in their real-life problems to get more acquainted with the random forest algorithm.