A Beginner’s Guide to Hoeffding Tree with Python Implementation

A Hoeffding tree is an incremental decision tree that is capable of learning from the data streams. The basic assumption about the data is that data is not changing over time helps in building a Hoeffding tree

The traditional pattern of machine learning starts with data processing and ends with data modelling. The model gives us the results according to satisfaction. But the crucial problem with the traditional procedure is the shortage of storage and time. Hence the newer technologies such as online machine learning are introduced where we can use conventional models in the data stream to obtain satisfactory results. In this article, we are going to discuss a model called  Hoeffding Tree which is based on the conventional decision tree designed for use in online machine learning. It outperforms other machine learning models while working with large data streams by assuming that the data distribution is not changing over time. The major points to be covered in this article are listed below. 

Table of Contents 

  1. Online Machine Learning
  2. Incremental Machine Learning
  3. Incremental Decision Tree
  4. Data Stream
  5. Hoeffding Tree
  6. Implementing Hoeffding Tree
    1. Scikit-MultiFlow
    2. StreamDM

Let us start our discussion by understanding the online machine learning that is the basis of our main discussion. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Online Machine Learning

Online machine learning is required when the data is coming to an algorithm in a sequential way and the machine learning algorithms need to use the data to update the best predictor so that the upcoming sequence of the data can be treated wisely at each upcoming step. Unlike other batch learning programs the online machine learning algorithms are mainly focused on the areas where models are computationally infeasible to train on the whole dataset and there is always a need for external memory algorithms. Online machine learning algorithms are prone to dynamically adapt to the new pattern in the data. 

Most of the time online machine learning algorithms are made such that they can forget the previously learned information so that they can be trained over the new information. This procedure of online machine learning can be addressed by incremental machine learning. These algorithms are majorly used in the domains like stock market prediction, weather forecasting etc. In the next section of this article, we are going to discuss incremental machine learning. There are a lot of points about online machine learning most of them are mentioned and explained in this article.  

Incremental Machine Learning

This is another type of machine learning which is basically focused on the learning of the model based on the continuous incoming data. Also using this method of learning we can increase the existing model knowledge. More formally we can consider it as a dynamic technique of supervised or unsupervised learning. These techniques are used with data that are available gradually over time. There are many conventional machine learning algorithms that can be used as incremental machine learning. We can also call them incremental algorithms.

The major goal of this type of algorithm is to adapt to the new changes in the data and also keep the old learning in the memory of the model so that there can be fewer chances of failure of the model or we can get higher accuracy in either of the situations. There are domains like big data and data streams where incremental machine learning techniques are frequently used. The stock market trend is an example of a data stream. 

Some of the examples of incremental algorithms are incremental decision tree(IDE4, ID5R), incremental SVM, RBF neural network, etc. since the article is about the Hoeffding Tree, which is also a decision tree so in the next section of the article we are going to cover some basic knowledge about the incremental decision tree.

Incremental Decision Tree

In the field of online machine learning, if any incremental machine learning algorithms which can output a decision tree can be considered as the incremental decision tree. The traditional decision tree methods imply the training of the algorithm using the whole data. Where the incremental decision tree methods are mainly focused on updating the existing tree so that the tree can be used for making predictions from the few instances of the newly arrived data. They are made such that they can be used in situations where enough dataset is not available to train the model but the tree is updated or the size of data is very high and it is varying over time.

In various domains such as online learning, data streams, concept drift these algorithms can be used and are also being used frequently. This is not enough here; these algorithms also can be used in situations where the data can not be modelled using hierarchical models and systems where the output required is interpretable. Most frequent uses of  the incremental decision tree can be seen in the data stream 

Data Stream

The data which is generated continuously in an incremental manner from different sources can be considered as the data stream. Basically, this type of data is generated gradually and flowing across different platforms which are designed as they can give the information to the user as a function of time. Usually, data streams are used in the context of big data where it is generated using different sources either at stable or unstable high speed. To know more about the data stream a reader can go to this link. Where various details of the data stream are mentioned. Since it is not the subject of focus in the article we are not going in the deep a basic level introduction is sufficient to understand the next section of the article. The next section is about the Hoeffding tree which is an incremental decision tree. 

Hoeffding Tree

A Hoeffding tree is an incremental decision tree that is capable of learning from the data streams. The basic assumption about the data is that data is not changing over time helps in building a Hoeffding tree. This is an algorithm that helps in making and growing a decision tree on the basis of the guarantees given by the additive Chernoff bound. Additive Chernoff bound is a probability theory that gives bounds that are exponentially decreasing on the tail distribution of the total of the random variable which is independent variables. These bounds are sharper than the other tail bounds such as Markov inequality. 

Using this probabilistic theory in Hoeffding decision trees any node is expanded as sufficient evidence of an optimal splitting feature is available. More formally saying additive Chernoff bound helps in exploiting the fact where the fact states that a lower amount of sample can give enough evidence to choose an optimal splitting attribute. The additive Chernoff bound can also be considered as the Hoeffding bound which basically gives the values in quantities of observation which can help in estimating any statistical results within a prescribed estimated range of results. 

Hoeffding bound can be considered as the proof for algorithm in incremental machine learning which shows that the output or accuracy of the algorithm is nearly the same as the output or result of the non-incremental or static machine learning algorithms. There can be various ways using which we can implement the Hoeffding tree. Some of the main methods are given below.

Implementing Hoeffding Tree

Let us implement the Hoeffding tree in python. We will go through the different steps as given below.

  • Scikit-MultiFlow

In the field of machine learning, sklearn is a library that gives us most of the algorithms of offline machine learning. When it comes to online machine learning or incremental machine learning we can use the scikit-Multiflow library. Let’s start with the installation of the library specially designed for online machine learning.

We can install the scikit-MultiFlow using the following command line and pip.

!pip install -U git+https://github.com/scikit-multiflow/scikit-multiflow


Some of the basic requirements to use the library are Python 3.5+, NumPy and Cython =< v0.5.0

Importing the libraries:

from skmultiflow.data import SEAGenerator
from skmultiflow.trees import HoeffdingTreeClassifier

Making a data stream:

state = 1
data_stream = SEAGenerator(random_state=state)

The module SEAGenerator helps in making a synthetic data stream.

Setting up some of the variables for controlling the procedure:

n_samples = 0
correct_cnt = 0
max_samples = 600

Making the instance of the HT classifier:

ht = HoeffdingTreeClassifier()

Training the model on the data stream:

while n_samples < max_samples and stream.has_more_samples():
    X, y = stream.next_sample()
    y_pred = ht.predict(X)
    if y[0] == y_pred[0]:
        correct_cnt += 1
    ht = ht.partial_fit(X, y)
    n_samples += 1

Displaying the results:

print('{} samples for analysis.'.format(n_samples))
print('accuracy: {}'.format(correct_cnt / n_samples))


As we can see that we have made a Hoeffding tree model and also we can see i8n the result that is pretty similar to a non-incremental decision tree algorithm results.

  • StreamDM

We can also implement the Hoeffding tree in the apache spark using the StreamDM library which is basically a library for mining big data stream using the spark stream/. Using the StreamDM we can utilize the following models on the data stream.

  • SGD Learner and Perceptron
  • Naive Bayes
  • CluStream
  • Hoeffding Decision Trees
  • Bagging
  • Stream KM++

Also in it, we have some of the data generators which can be used for practice on the model.

  • HyperplaneGenerator
  • RandomTreeGenerator
  • RandomRBFGenerator
  • RandomRBFEventsGenerator

Let’s see how we can use the streamDM for making the Hoeffding Trees.

We can use the class HoeffdingTree which is implemented in the HoeffdingTreeModel. The class is controlled by the following major parameter. 

  • -n: for the moment where only the parameter is supported, only the gaussian parameter.
  • -g: number of examples observed by the leaf before the attempt to split. 
  •  -s: split criterion.
  • -c: allow error in a split decision. 

Final Words 

In the article, we discussed online machine learning and incremental machine learning along with the incremental decision trees that are the foundations of a Hoeffding tree. We have also seen an incremental tree that can be applied to the data stream and also we have gone through some of the major ways to implement a Hoeffding tree in the data stream.


Hoeffding Decision Trees

Yugesh Verma
Yugesh is a graduate in automobile engineering and worked as a data analyst intern. He completed several Data Science projects. He has a strong interest in Deep Learning and writing blogs on data science and machine learning.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox