As the field of machine learning and artificial intelligence advances to solve a plethora of problems, there is a surge in the number of tools available to develop robust models. Developers may often run into problems about which tool to choose and spend a lot of time understanding the compatibility and features of these tools. Other than this, there is a gap between data science skill supply and demand. To solve these problems, organizations are coming up with frameworks that automatically process the dataset and build a baseline model. One such organization is H2O.ai.
In this article, we will look at:
- Who is H2O.ai?
- Features and capabilities of H2O.ai
- Demonstration of AutoML in model development and prediction using H2o.ai
The company aims to create open-source machine learning products to make machine learning accessible and allow users to extract insights from data, without needing expertise in deploying or tuning machine learning models.
Sign up for your weekly dose of what's up in emerging technology.
They provide a range of products like
- H2O: This is an open-source, memory inclusive and distributed machine learning platform to build supervised and unsupervised machine learning models. It also includes a user-friendly UI platform called Flow where you can create these models.
- Sparkling water: This platform is an integration of Spark and H2O for existing Spark ecosystem users to build their models.
- Deepwater: Deepwater is an integration of H2O with Tensorflow, Caffe and MXNet. They are used to work on GPU based models used in deep learning and reinforcement learning.
- Steam: Steam is an enterprise product that allows you to deploy models that you build. You can also convert a trained model into an API for others to access.
Features and capabilities
The H2O platform has a set of different features and capabilities that are discussed below.
- Clusters: H2O is a java virtual machine capable of performing parallel computations for machine learning on clusters. Clusters are software with one or multiple nodes. These can be launched in your laptop, a server or multiple machines if more than one node is used. Memory is stored in a compressed columnar format, allowing you to read the data in parallel. A cluster memory capacity is a sum of memories across all H2O nodes in the cluster. This is a much more flexible and efficient way of modelling because not only does it provide better efficiency for CPU but also provides great flexibility while scaling the model. These clusters are what make H2O fast.
- Flow: Flow is an interactive user interface that allows you to execute code, write text and plot graphs. It is similar to notebooks like jupyter notebook or collaboratory notebook. The uniqueness of this is it does not display the output just as plain text, it allows you to point and click and interact with objects in the form of tabular data. Flow runs on your localhost. Flow supports REST API, R scripts, and CoffeeScript and no programming experience is required to run H2O Flow. You can click your way through any H2O operation without ever writing a single line of code.
- AutoML: Automatic ML is designed to have as few parameters as possible while modelling so that all the user has to do is upload a dataset, distinguish between features and target for prediction and set the number of trainable models. Every other process is automated with a goal of finding the best fitting model for that dataset. AutoML is also known for being able to select and build high accuracy ensemble models.
Demonstration of AutoML
We saw that H2O provides a lot of unique and out of the box capabilities to achieve faster and more efficient modelling. Let us now look at a hands-on demonstration on how to build a model using AutoML.
If you are using flow, you can just download and begin working on it. But since I am using a jupyter notebook, I need to install h2o packages before coding.
!pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o
In order to keep this demonstration simple, we will build a classification model for detecting diabetes. Click here to download the dataset. Once downloaded, upload to the drive or working directory.
Creating a cluster
Every model in the H2O environment works on clusters. To create a new cluster follow these steps :
Doing this will create a new cluster.
After the cluster has been created, let us now load our data and start AutoML.
diabetes_data = h2o.import_file("diabetes.csv")
The describe function allows us to get a description of data types, missing values and other attribute information.
Note: AutoML is designed to think of all problems as regression problems unless specified. To make sure this problem is not considered as a regression problem, convert the data type of your target to enum using the following command.
diabetes_data['Outcome'] = diabetes_data['Outcome'].asfactor()
Let us now split the data as train and test. I will 80% of data for training and 20% for testing. Data splitting is done using spilt_frame() method.
diabetes_split = diabetes_data.split_frame(ratios = [0.8])
db_train = diabetes_split
db_test = diabetes_split
Next step is to assign labels and target names to variables.
x=['Pregnancies','Glucose', 'BloodPressure','SkinThickness','Insulin' ,'BMI','DiabetesPedigreeFunction','Age']
All our data is ready and it is time to pass it to AutoML function. AutoML provides an entire leaderboard of all the models that it ran and which worked best.
automl = H2OAutoML(max_models = 30, max_runtime_secs=300, seed = 1)
automl.train(x = x, y = y, training_frame = db_train)
leader = automl.leaderboard
This leader board shows us that a stacked ensemble model gives us the best accuracy. Let us make a prediction on test data to understand if the model is working correctly.
predictions = automl.predict(db_test[:-1])
76% accuracy is a good one considering the fact that we have not pre-processed or performed any feature engineering on the dataset. The model can be saved as follows.
h2o.save_model(automl.leader, path = "your_directory_path")
Since all of these are developed on clusters in order to release memory and dependencies, you need to shut down the cluster using
H2O’s goal to make ML easy for everyone and to democratize AI is growing at a rapid pace. With tools like these, it is possible to try and bridge the gap between supply and demand of machine learning engineers. H2O platforms are powerful for developers to explore multiple techniques and to build models in a short period of time.