AutoGluon, an open-source tool from AWS which is easily available to everyone, facilitates a variety of AutoML (Automated Machine Learning) tasks. It helps in automating different machine learning and deep learning tasks and figuring out the best suitable model for a particular task. In this post, we will discuss the AutoGluon and we will see its different features to support automating machine learning tasks. We will go through the implementation of tabular prediction using AutoGluon to understand how a particular machine learning task can be automated using it. In the end, we will also try to understand how one can find out the best suitable model for a particular machine learning task when using AutoGluon. The major points to be discussed in this article are listed below.
Table of Contents
- What is AutoML?
- The AutoGluon
- What Can be Done with AutoGluon?
- AutoGluon for Tabular Data
- How to Get an Optimal Model?
Let’s start the discussion by knowing what AutoML actually is.
Sign up for your weekly dose of what's up in emerging technology.
What is AutoML?
Automated machine learning refers to the process of automating the tasks of applying machine learning to real-world problems (AutoML). AutoML covers the whole pipeline, from the raw dataset to the deployable machine learning model. AutoML was proposed as an AI-based solution to the ever-growing problem of machine learning applications. Because of AutoML’s high level of automation, non-experts can use machine learning models and procedures without becoming machine learning professionals.
Automating the entire machine learning process has the added benefit of providing simpler solutions, faster generation of those solutions, and models that frequently outperform hand-designed models. In a prediction model, AutoML was utilized to compare the relative relevance of each factor.
Automated Machine Learning research has produced a wide range of packages and approaches aimed at both researchers and end-users. Several off-the-shelf software that allows automated machine learning has been created in recent years. The packages that have been developed so far are listed below.
- AutoGluon is a multi-layer stacking method for various machine learning models.
- MLBoX is a three-part AutoML toolkit that includes preprocessing, optimization, and prediction.
- AutoWEKA is a method for selecting a machine learning algorithm and its hyperparameters at the same time; when combined with the WEKA package, it produces good models for a wide range of data sets automatically.
- Auto-sklearn is a Python package that extends AutoWEKA and serves as a drop-in replacement for conventional scikit-learn classifiers and regressors.
- Auto-PyTorch is based on the PyTorch deep learning framework and adjusts hyperparameters and neural architecture simultaneously.
AutoGluon is an open-source AutoML tool that uses just one line of Python code to train extremely accurate machine learning models on unprocessed tabular datasets like CSV files. AutoGluon succeeds by assembling several models and stacking them in various layers, unlike other AutoML frameworks that largely focus on model/hyperparameter selection. Experiments show that our multi-layer combination of several models makes better use of training time than searching for the best.
The following are the design principles of the AutoGluon:
- Simplicity. A user can immediately train a model on raw data without knowing anything about the data or ML models.
- Robustness. The framework can handle a wide range of structured datasets and ensures that training continues even if any of the individual machine learning models fail.
- Fault Tolerance. At any point during the course, you can pause and resume it. When dealing with cloud preemptible (spot) instances, this approach is preferred.
- Timing that can be predicted. Users can specify a timeframe for the results to be returned.
AutoGluon allows simple-to-use and extensible AutoML, with a focus on automated stack ensembling, deep learning, and real-world applications encompassing text, image, and tabular data. AutoGluon, designed for both novices and specialists in machine learning, and provides features like:
- With just a few lines of code, we can quickly prototype deep learning and traditional ML solutions for your raw data.
- Automatically employ cutting-edge techniques (when appropriate) without the need for specialist knowledge.
- Automatic hyperparameter tweaking, model selection/ensembling, architectural search, and data processing are all possible.
- Improve/tune your custom models and data pipelines with ease, or tailor the AutoGluon to your needs.
What Can be Done with AutoGluon?
With AutoGluon, the Machine Learning developers can accomplish the following tasks:-
AutoGluon can generate models to predict the values in one column based on the values on the other columns using the common and standard datasets that are represented as tables (usually stored as CSV files). We can obtain excellent accuracy in standard supervised learning tasks such as classification and regression with just a single .fit() function. Additionally, there are tons of parameters that we can tune to even optimize the performance. Without having to deal with time-consuming procedures like data cleaning, feature engineering, rigorous hyperparameter tuning, algorithm selection and so on we can conclude our journey in a very effective way.
AutoGluon again provides a simple fit() function for classifying photos based on their content which generates high-quality image classification models automatically. A single call fit() will return an extremely accurate neural network on the image dataset we give, automatically employing accuracy-enhancing techniques like transfer learning and hyperparameter optimization on our behalf. Also here we can also prepare a dataset using the CSV files or we can organize the data into proper directories using its various functional APIs.
AutoGluon provides a simple fit() function for identifying the presence and placement of objects in photos, which creates high-quality object detection models automatically. A single call to fit() will train extremely accurate neural networks on the picture dataset you provide, automatically employing accuracy-boosting techniques like transfer learning and hyperparameter tuning.
To generate high-quality text prediction models automatically ( usually the transformer neural network) fit() also can be used for this supervised kind of task. Each training sample could be the sentence, a brief paragraph, a combination of numerous text fields (e,g. Predicting how similar the two-sentence are), or it could even include other numeric/ categorical variables in addition to the text. The predicted values can be continuous values (regression) or discrete categories (classification).
A quick call to prediction is all it takes. The fit() method will automatically use accuracy boosting approaches including fine-tuning a pre-trained NLP model and hyperparameter optimization to train a highly accurate neural network on the input text dataset.
Text data may be blended with numerical/categorical data in various applications. TextPredictor from AutoGluon can train a single neural network that works on many feature types at the same time, such as text, categorical, and numerical columns. The fundamental idea is to segregate the text, category, and numeric fields and combine them across modalities. To address such multimodal tasks AutoGluon can be used. We can train a multi-model ensemble using data such as images and associated features with it in tabular form.
Let us process further with understanding how AutoGluon can be used with tabular data for tabular predictions. In the further section, we will see how we can effectively perform the classification task using a top-notch performance of AutoGluon Tabular.
AutoGluon on Tabular Data
AutoGluon-Tabular is a simple and accurate method for working with tabular data. AutoGluon-Tabular is capable of complex data processing, deep learning, and multi-layer model assembly. It recognizes the data type in each column automatically for comprehensive data preprocessing, including particular handling of text fields. AutoGluon supports a wide range of models, from off-the-shelf boosted trees to bespoke neural network models.
These models are ensembled in an innovative way: models are stacked in many layers and trained layer by layer, ensuring raw data can be transformed into high-quality predictions within a specified time restriction. Over-fitting is reduced throughout the process by splitting the data in different ways and keeping careful track of out-of-fold cases.
Below you can see the AutoGluon’s neural network architecture for tabular data is made up of numerical and categorical features. Layers with learnable parameters are denoted by the colour blue.
How to Get an Optimal Model?
Consider a structured dataset of raw values saved in a CSV file, such as Stars.csv, with the predicted label values stored in a column, labelled ‘Type.’ AutoGluon automatically preprocesses the raw data, determines the type of prediction problem (binary, multi-class classification, or regression), partitions the data into various folds for model-training vs. validation, fits various models individually, and finally creates an optimized model ensemble that outperforms any of the individual trained models.
Fit() includes extra hyperparameters that can be set for users that are prepared to suffer longer training times in order to maximize predicted accuracy. All intermediate outcomes are saved to disk. If a call was cancelled, we can resume training by using
fit() with the option to
Let’s implement it. The dataset used for the experiment is taken from this Kaggle repository which is about Predicting the type of Stars based on 6 attributes.
To date, officially AutoGluon is not supported in the windows system. It is available for Linux and macOS as well. To get started, we need to install it as
pip install mxnet autogluon.
import pandas as pd from sklearn.model_selection import train_test_split # load the autogluon predictor from autogluon.tabular import TabularPredictor df = pd.read_csv('Stars.csv') df.head()
Observe the data carefully we are going to feed the data as it is without any pre-processing steps like encoding the categorical variable as there are two. This is one of the beauties of the AutoGluon.
Now let us split the data into train and test after that we are ready to train all available models inside the AutoGluon.
# split into train and test train, test = train_test_split(df,random_state=42, test_size = 0.3) y_test = test['Type'] test_nolab = test.drop(['Type'],axis=1) # train models predictor = TabularPredictor(label='Type').fit(train)
Now we can check the probability assigned for each class by the Top classifier and also we look at the leaderboard of the models.
pred_probs = predictor.predict_proba(test_nolab) pred_probs.head(5)
# Leaderboard for the trained model predictor.leaderboard(test, silent=True)
We could see above that from the available 13 models, the top performer is observed as the CatBoost classifier though it was almost tied with LightGBM. The CatBoost beat it by taking less prediction time. Through this article, we had an understanding of AutoGluon, the AutoML library which automates most of the tasks like classification regression on tabular data, object detection, text classification, and image classification.