Implementing all decision tree algorithms with one framework – ChefBoost

ChefBoost is one python package that provides functions for implementing all the regular types of decision trees and advanced techniques. One thing which is noticeable about the package is we can build any version of the above-given decision tree using just a few lines of codes. 

There are different decision tree algorithms used in different use cases and a couple of them can be applied to the same problem. All we need to do is to change the algorithm, fit them and compare their performances separately. In this article, we are going to cover an approach through which we can run all the decision tree algorithms using the same framework quickly and compare the performance easily. We are going to use ChefBoost which is a lightweight decision tree framework and we can implement decision tree algorithms using it in just a few lines of code. The major points to be discussed in this article are listed below.

Table of contents

  1. Brief about decision trees
  2. Different decision tree algorithms
  3. About ChefBoost Framework
  4. Implementing all decision tree algorithms with ChefBoost

Let’s start with understanding decision trees in brief.

Brief about decision trees

Various algorithms can be used in regression analysis and classification analysis and the decision tree is one of those algorithms. As the name suggests these algorithms can be compared to a tree to make any decision. We mainly find the usage of such a tree when the data that we are dealing with has nonlinear relationship properties. The below picture is a representation of the working of any decision tree.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

In the above picture, we can see that based on the two conditions a student needs to decide whether he should go to school or not. This decision can be taken based on certain criteria. Like if he thinks that he is not suffering from the covid then there is a relationship which generates in the next step that because of going to the school he can be suffering from covid and again there is a condition that needs to be confirmed before going to the school. So in the example, we can say that the block is the nodes, and splitting these nodes makes us reach a decision. This is a very simple example and we can easily say in what condition what decision needs to be taken by the student. 

One thing that majorly affects the decision taken by the algorithm is the huge data. In such conditions, there are various things that we need to know about the decision tree algorithm such as information gain, entropy, number of splits, etc. one of our articles covers most of this information that we can gain about the decision tree. We can also call this algorithm the root of the forest algorithms. Decision tree algorithms are not only one algorithm but it has various versions. 


Download our Mobile App



Are you looking for a complete repository of Python libraries used in data science, check out here.

Different decision tree algorithms

All the decision tree algorithms are based on extracting features from the data that can give the highest information gain. But as every updated version has some better qualities than the older one, decision trees also become better with simple updates. There are 5 popular versions of decision tree algorithms:

  • ID3: We can compare this algorithm of the decision tree with the standard version and it is the acronym of iterative dichotomization where the word ‘to dichotomize’ can be explained as dividing a condition into two opposite decisions. So the whole phenomenon can be considered as iteratively dividing the conditions into two major decisions and others to construct the tree. Finally, after the construction of the tree algorithm calculates the information like entropy and information gain, the most dominant decision comes out as the final result. 
  • C4.5: One thing that we have discussed in the ID3 decision tree algorithm is that it mainly finds the dominant decision for various conditions but when it comes to the continuous data there is no addition in this algorithm. To deal with continuous data the C4.5 comes into the picture and using this version we can also deal with the continuous data and missing values. 
  • CART: As a practitioner of data science we must have listened to the Gini index in decision trees and random forests. For the very first time, this index is introduced with this CART version of the decision tree algorithms. Using this index this algorithm is capable of calculating the over information gain. We understand the Gini index as the process of subtracting the sum of squared probabilities of each class from one. This version is also capable of dealing with both types of problems(classification and regression). 
  • CHAID: This algorithm can be considered one of the oldest algorithms of the decision tree and it is capable of finding the features that have the higher statistical significance. The word CHAID stands for chi-square automatic interaction detection and we can easily understand it because chi-square is used for determining the statistical significance of the features. This version is mainly known for solving the classification problem.
  • Regression trees: This algorithm of decision trees is mainly designed to perform regressions. In the above, we have seen that the other versions of trees are good with the classification problem and no one has specially designed to perform regression. The approach of this tree is to make a continuous feature a categorical feature. The significance of continuous features can be extracted where the feature shows the highest information gain. 

Here we have discussed the different versions of decision trees and one thing which can be more appreciated is that they perform advanced techniques like gradient boosting, random forest, and AdaBoost to get better results which are up-gradation on decision trees. Let’s take an idea of how we can implement decision trees. 

About ChefBoost Framework

Before making this article we have gone through various researches and found that ChefBoost is one python package that provides functions for implementing all the regular types of decision trees and advanced techniques. One thing which is noticeable about the package is we can build any version of the above-given decision tree using just a few lines of codes. 

This package supports the data in the form of a pandas dataframe which makes the process easier for the people who are good at using pandas for data preprocessing. Functions are designed as by just pushing pandas dataframe and type of decision tree we can build a decision tree model. Naming conventions for the trees are also simple as we named in the above section. For example, to implement ID3 we just need to pass ‘ID3’ in the configuration function. 

The implementation we are going to perform in this article will let us know about the usage of this package. We can install this package using the following lines of codes.

!pip install chefboost

Implementing all decision tree algorithms with ChefBoost

There are a variety of ways to implement a decision tree. So we will look at just how we can implement a decision tree using this package.

Let’s take a look at how we can build decision trees. We will start by importing golf data that can be found here

import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/serengil/chefboost/master/tests/dataset/golf.txt')
data.head()

Output:

Here we can see that there are 4 conditions using which we or the algorithm can tell whether to play golf or not. Let’s build the model.

config = {'algorithm': 'C4.5'}
model = chef.fit(data, config = config, target_label = 'Decision')

Output:

Here we can see the details we got by modelling the decision tree’s C4.5 version. Let’s make a prediction.

prediction = chef.predict(model, param = data.iloc[0])
prediction

Output:

Here in the model codes, we can see that we have set the version C4.5 to model C4.5 version of the decision tree. We can simply change it according to our requirements. Let’s try one more version to verify this.

config = {'algorithm': 'ID3'}
model = chef.fit(data, config = config, target_label = 'Decision')

Output:

Here in the output, we can see that the ID3 version is built this time. Although I have implemented all the above-mentioned decision tree algorithms in this link for more ideas we can go through. 

Final words 

In this article, we have discussed the decision tree which is a supervised machine learning algorithm that can be used for regression and classification problems and has various versions of itself. Along with this, we have looked at the idea of implementing all versions of decision trees using a single framework named ChefBoost.

References 

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Yugesh Verma
Yugesh is a graduate in automobile engineering and worked as a data analyst intern. He completed several Data Science projects. He has a strong interest in Deep Learning and writing blogs on data science and machine learning.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR