Decision trees are one of the classical supervised learning techniques used for classification and regression analysis. When it comes to giving special considerations to the features to be used for modelling purposes, the decision tree is the best-suited algorithm in this case. The reason behind this benefit is that it uses different criteria to construct the trees by fully focusing on the features of the training data. Information gain is one of such criteria that is used to construct the decision trees based on the training features. In this article, we will have an in-depth understanding of how information gain is used with a decision tree with a real-life example. The major points to be covered in the article are listed below

**Table of Contents**

- Decision Trees
- Information Gain
- What is Entropy?
- Steps to Split Decision Tree using Information Gain
- Entropy for Parent Node
- Entropy for Child Node
- Weighted Entropy Calculation
- Calculation of Information Gain

Before proceeding further, let us have a quick understanding of the decision tree so that we can make a fundamental for our main discussion.

**Decision Trees**

Before going deep into the main concept of the article let us have a basic introduction of the decision tree. so basically the decision tree algorithm is a supervised learning algorithm that can be used in both classification or regression analysis. Unlike linear algorithms, decision trees algorithms are capable of dealing with nonlinear relationships between variables in the data.

The above diagram is a representation of the workflow of a basic decision tree. Where a student needs to decide on going to school or not. In this example, the decision tree can decide based on certain criteria. The rectangles in the diagram can be considered as the node of the decision tree. And split on the nodes makes the algorithm make a decision. In the above example, we have only two variables which are very basic and easy for us to understand where and which node to split. To perform a right split of the nodes in case of large variable holding data set information gain comes into the picture.

**Information Gain **

The information gained in the decision tree can be defined as the amount of information improved in the nodes before splitting them for making further decisions. To understand the information gain let’s take an example of three nodes

As we can see in these three nodes we have data of two classes and here in node 3 we have data for only one class and similarly, we have less data for the second class than the first class in node 2, and node 1 is balanced. By this above, we can say that in node three we don’t need to make any decision because all the instances are representing the direction of the decision in the class first side wherein in node 1 there are 50% chances to decide the direction of both classes. We can say that in node 1 we are required more information than the other nodes to describe a decision. By the above, we can say the information gain in node 1 is higher.

By the above, we can say the balanced nodes or most impure nodes require more information to describe. Let’s take a look at the below image on two nodes with different impurities.

Here we can see that the node on the right side after split gives us heterogeneous nodes where the node on the left side gives us homogeneous nodes and as we have discussed in the above node on the left has more information gain than the other nodes and by this, we can infer that increment in the information gain gives more homogeneous or pure nudes.

To measure the information gain we use the entropy. Which is a quantified measurement of the amount of uncertainty because of any process or any given random variable.

Information Gain = 1 – Entropy

**What is Entropy?**

As of now, we are talking about the information gain which comes under the subject of information theory, and also in information theory, the entropy of any random variable or random process is the average level of uncertainty involved in the possible outcome of the variable or process. To understand it more let’s take an example of a coin flip. Where we have only two probabilities either it will be a tail, or it will be a head and if the probability of tail after flip is p then the probability of a head is 1-p. And the maximum uncertainty is for p = ½ when there is no reason to expect one outcome over another. Here we can say that the entropy here is 1 and if the event is known and the maximum uncertainty is for p=0 or p=1 entropy is 0 bits. Mathematically the formula for entropy is:

Where

X = random variable or process

Xi = possible outcomes

p(Xi) = probability of possible outcomes.

So for any random process or variable if we have calculated the probabilities of possible outcomes and using these probabilities if we have calculated the entropy we can easily calculate the information gain.

Let’s take an example of a family of 10 members, where 5 members are pursuing their studies and the rest of them have completed or not pursued.

% of pursuing = 50%

% of not pursuing = 50%

Let’s first calculate the entropy for the above-given situation.

Entropy = -(0.5) * log2(0.5) -(0.5) * log2(0.5) = 1

Following the above formula of entropy, we have filled the values where the probabilities are pursuing or not pursuing is 0.5 and log of 0.5 base two is -1. Let’s assume a family of 10 members where everyone has already pursued graduation.

% of pursuing = 0%

% of not pursuing = 100%

According to this the entropy of a given situation will be

Entropy = -(0) * log2(0) -(1) * log2(1) = 0

From the above, we can say that if a node is containing only one class in it or formally says the node of the tree is pure the entropy for data in such node will be zero and according to the information gain formula the information gained for such node will we higher and purity is higher and if the entropy is higher the information gain will be less and the node can be considered as the less pure.

One thing which is noticeable here is that the information gain or entropy works with only categorical data.

In the above example, we have seen how we can calculate the entropy for a single node. Let’s talk about a tree or decision tree where the number of nodes is huge.

**Calculate Entropy Step by Step**

Since we know that in a decision tree we have parent nodes and child nodes. The parent node can also be called the root node of the tree. We are starting with a split of the parent node and after splitting every type of node the weighted average entropy of the nodes will be the final entropy which can be used for calculating the information gain.

**Entropy for Parent Node**

Here we are starting with an example of class 11th and class 12th students where we have a total of 20 students. And on the basis of performance, we have different splits in the parent node, and on the basis of their class level, similar parent nodes can be split. Like the below-given image.

Now according to the performance of the students, we can say

Now the entropy for the parent node will

Entropy = -(0.5) * log2(0.5) -(0.5) * log2(0.5) = 1

Here we can see the entropy for the parent node is 1 this is the entropy of the parent node.

**Entropy for Child Node**

There is no difference in the formula of entropy of any child node it is the same for every nod we can simply put the values in the formula wherein one chile node we have 57% students are performing the curricular activity and others are not

Entropy = -(0.43) * log2(0.43) -(0.57) * log2(0.57) = 0.98

We can see the entropy for one child node. Let’s do the same with the other child node where 33% of students are involved in the curricular activity.

Entropy = -(0.33) * log2(0.33) -(0.67) * log2(0.67) = 0.91

**Weighted Entropy Calculation**

As of now, we have calculated the entropy for the parent and child nodes now the weighted sum of these entropies will give the weighted entropy of all the nodes.

Weighted Entropy : (14/20)*0.98 + (6/20)*0.91 = 0.959

Hereby the weighted entropy we can say that the split on the basis of performance will give us the entropy around 0.95. Here we can see the final entropy is lower than the entropy given by the parent node so we can say that the child node will give the pure node(less number of classes in the node)than the parent node of the tree.

A similar procedure we can follow for the splits based on the class

Entropy for parent nodes

Entropy = -(0.5) * log2(0.5) -(0.5) * log2(0.5) = 1

Entropy for child nodes

Class 11th

Entropy = -(0.8) * log2(0.8) -(0.2) * log2(0.2) = 0.722

Class 12th

Entropy = -(0.2) * log2(0.2) -(0.8) * log2(0.8) = 1

Weighted entropy

Weighted Entropy : (10/20)*0.722 + (10/20)*0.722 = 0.722

Again we can see that the weighted entropy for the tree is less than the parent entropy. Using these entropies and the formula of information gain we can calculate the information gain.

**Calculation of Information Gain**

The formula of information gain based on the entropy is

Information Gain = 1 – Entropy

This is the same also with the weighted entropy. The below table is the representation of the information gain value of the example using the entropy

Split | Entropy | Information Gain |

Performance of class | 0.959 | 0.041 |

class | 0.722 | 0.278 |

As we have discussed in the earlier section of the article that instrument in the information gain causes in the homogeneous split of the node or formation of the pure nodes hence in the above example the split based on the class will give us more homogeneous nodes as the child than the nodes produces buy the split on the basis of performance.

**Final Words**

In this article, we have seen how information gain can be used for constructing a decision tree. The information gain criteria for splitting the nodes work with only categorical data and is based on the entropy of the split. Also, this is a good function to use in working with decision trees as we have seen it works by taking uncertainty and surprise into account.