Hands-on Tutorial On Automatic Machine Learning With H2O.ai and AutoML

In this article, we will look at Who is H2O.ai, Features and capabilities of H2O.ai, Demonstration of AutoML in model development and prediction using H2o.ai
h2o ai

As the field of machine learning and artificial intelligence advances to solve a plethora of problems, there is a surge in the number of tools available to develop robust models. Developers may often run into problems about which tool to choose and spend a lot of time understanding the compatibility and features of these tools. Other than this, there is a gap between data science skill supply and demand. To solve these problems, organizations are coming up with frameworks that automatically process the dataset and build a baseline model. One such organization is H2O.ai. 

In this article, we will look at:

  1. Who is H2O.ai?
  2. Features and capabilities of H2O.ai
  3. Demonstration of AutoML in model development and prediction using H2o.ai
What is H2O.ai?

The company aims to create open-source machine learning products to make machine learning accessible and allow users to extract insights from data, without needing expertise in deploying or tuning machine learning models.

They provide a range of products like 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
  • H2O: This is an open-source, memory inclusive and distributed machine learning platform to build supervised and unsupervised machine learning models. It also includes a user-friendly UI platform called Flow where you can create these models. 
  • Sparkling water: This platform is an integration of Spark and H2O for existing Spark ecosystem users to build their models. 
  • Deepwater: Deepwater is an integration of H2O with Tensorflow, Caffe and MXNet. They are used to work on GPU based models used in deep learning and reinforcement learning. 
  • Steam: Steam is an enterprise product that allows you to deploy models that you build. You can also convert a trained model into an API for others to access. 

Features and capabilities

The H2O platform has a set of different features and capabilities that are discussed below. 

  1. Clusters: H2O is a java virtual machine capable of performing parallel computations for machine learning on clusters. Clusters are software with one or multiple nodes. These can be launched in your laptop, a server or multiple machines if more than one node is used. Memory is stored in a compressed columnar format, allowing you to read the data in parallel.  A cluster memory capacity is a sum of memories across all H2O nodes in the cluster. This is a much more flexible and efficient way of modelling because not only does it provide better efficiency for CPU but also provides great flexibility while scaling the model. These clusters are what make H2O fast. 
  1. Flow: Flow is an interactive user interface that allows you to execute code, write text and plot graphs. It is similar to notebooks like jupyter notebook or collaboratory notebook. The uniqueness of this is it does not display the output just as plain text, it allows you to point and click and interact with objects in the form of tabular data. Flow runs on your localhost. Flow supports REST API, R scripts, and CoffeeScript and no programming experience is required to run H2O Flow. You can click your way through any H2O operation without ever writing a single line of code. 
  1. AutoML: Automatic ML is designed to have as few parameters as possible while modelling so that all the user has to do is upload a dataset, distinguish between features and target for prediction and set the number of trainable models. Every other process is automated with a goal of finding the best fitting model for that dataset. AutoML is also known for being able to select and build high accuracy ensemble models. 

Demonstration of AutoML

We saw that H2O provides a lot of unique and out of the box capabilities to achieve faster and more efficient modelling. Let us now look at a hands-on demonstration on how to build a model using AutoML. 

If you are using flow, you can just download and begin working on it. But since I am using a jupyter notebook, I need to install h2o packages before coding. 

!pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o

In order to keep this demonstration simple, we will build a classification model for detecting diabetes. Click here to download the dataset. Once downloaded, upload to the drive or working directory. 

Creating a cluster 

Every model in the H2O environment works on clusters. To create a new cluster follow these steps :

import h2o


Doing this will create a new cluster.

After the cluster has been created, let us now load our data and start AutoML.

diabetes_data = h2o.import_file("diabetes.csv")


h2o ai

The describe function allows us to get a description of data types, missing values and other attribute information. 


h2o ai

Note: AutoML is designed to think of all problems as regression problems unless specified. To make sure this problem is not considered as a regression problem, convert the data type of your target to enum using the following command.

diabetes_data['Outcome'] = diabetes_data['Outcome'].asfactor()

Let us now split the data as train and test. I will 80% of data for training and 20% for testing. Data splitting is done using spilt_frame() method. 

diabetes_split = diabetes_data.split_frame(ratios = [0.8])

db_train = diabetes_split[0] 

db_test = diabetes_split[1] 

Next step is to assign labels and target names to variables. 

x=['Pregnancies','Glucose', 'BloodPressure','SkinThickness','Insulin' ,'BMI','DiabetesPedigreeFunction','Age']


All our data is ready and it is time to pass it to AutoML function. AutoML provides an entire leaderboard of all the models that it ran and which worked best. 

automl = H2OAutoML(max_models = 30, max_runtime_secs=300, seed = 1)

automl.train(x = x, y = y, training_frame = db_train)

leader = automl.leaderboard



h2o ai

This leader board shows us that a stacked ensemble model gives us the best accuracy. Let us make a prediction on test data to understand if the model is working correctly.

Making Predictions

predictions = automl.predict(db_test[:-1])


h2o ai

76% accuracy is a good one considering the fact that we have not pre-processed or performed any feature engineering on the dataset. The model can be saved as follows.

h2o.save_model(automl.leader, path = "your_directory_path")

Since all of these are developed on clusters in order to release memory and dependencies, you need to shut down the cluster using 



H2O’s goal to make ML easy for everyone and to democratize AI is growing at a rapid pace. With tools like these, it is possible to try and bridge the gap between supply and demand of machine learning engineers. H2O platforms are powerful for developers to explore multiple techniques and to build models in a short period of time.

Bhoomika Madhukar
I am an aspiring data scientist with a passion for teaching. I am a computer science graduate from Dayananda Sagar Institute. I have experience in building models in deep learning and reinforcement learning. My goal is to use AI in the field of education to make learning meaningful for everyone.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox