Automated machine learning algorithms can be a huge time saver especially if the data is huge or the algorithm to be used is a simple classification or regression type problem. One such open-source automation in AutoML was the development of AutoSklearn. We know that the popular sklearn library is very rampantly used for building machine learning models. But with sklearn, it is up to the user to decide the algorithm that has to be used and do the hyperparameter tuning. With autosklearn, all the processes are automated for the benefit of the user. The benefit of this is that along with data preparation and model building, it also learns from models that have been used on similar datasets and can create automatic ensemble models for better accuracy.
In this article, we will see how to make use of autosklearn for classification and regression problems.
Installing the package
Before we understand how to build models with autosklearn we need to install the package in our working environment. To do this we can use the pip command if you have a Linux Operating system.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
pip3 install auto-sklearn
However, if you are making use of Colab you will need to install the following:
!sudo apt-get install build-essential swig !curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 pip install !pip install auto-sklearn
This will install the library and we can move to the next step.
AutoSklearn for classification problems
Now that we have everything needed to start we can build a model using autosklearn on a classification type problem. For these types of problems, we need to configure the method called AutoSklearnClassifier. Let us first select the dataset and then proceed with the model.
I will use a simple wine quality dataset from the UCI repository. For using the same dataset you can download it here. Now let us load the dataset.
from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score import pandas as pd from autosklearn.classification import AutoSklearnClassifier wine= read_csv('https://raw.githubusercontent.com/sharmaroshan/Wine-Quality-Predictions/master/winequality-red.csv') wine
Splitting the dataset
Now, let us split the dataset into training and test sets and also split the dataset into features and targets respectively.
dataset = wine.values ft, target = dataset[:, :-1], dataset[:, -1] X_train, X_test, y_train, y_test = train_test_split(ft, target, test_size=0.2, random_state=1)
Building the classification model
Since we are using auto-sklearn, we need not specify the name of the algorithm or the parameters. These are done automatically for us and the final result is displayed.
autosk = AutoSklearnClassifier(time_left_for_this_task=60*2) autosk.fit(X_train, Y_train) print(autosk.sprint_statistics())
Time_left_for_this_task is the amount of time the user specifies for searching all the right models. I have allowed the search to take place for two minutes but you can choose any amount of time as you wish.
Now we have the statistics of the model and the algorithms that were checked were 21. Let us now see the accuracy of the model.
pred = autosk.predict(X_test) print("Accuracy score", sklearn.metrics.accuracy_score(y_test, pred))
This is a good score since we have not scaled or pre-processed the data and we have allowed the model to run only for 2 minutes. Thus, we have built a classification model using autosklearn.
Autosklearn for regression
We have already seen how autosklearn works for classification type of models. Next, let us implement this for a regression problem and check the results.
For this, I will use the built-in sklearn dataset called Boston housing dataset. Let us now load the dataset. The task here is to predict the price of houses in Boston using the features given.
from sklearn.datasets import load_boston import pandas as pd boston_data=load_boston() features=pd.DataFrame(boston_data.data,columns=boston_data.feature_names) target=pd.DataFrame(boston_data.target,columns=['TARGET']) dataset=pd.concat([features,target],axis=1)
Splitting the dataset
Let us split this dataset into train and test data using the train_test_split function of sklearn.
Just like we used the autosklearnclassifier for classification, we will be using autosklearnregressor for regression models.
regressor=autosklearn.regression.AutoSklearnRegressor(time_left_for_this_task=60*5) regressor.fit(xtrain, ytrain)
Here I have given the time as 5 minutes to see the impact on the results.
Now, let us see the statistics of the model along with the error rate. Since this is a regression problem we will use the mean absolute error as the metric.
print(regressor.sprint_statistics()) pred= model.predict(xtest) mae = mean_absolute_error(ytest, pred) print("MAE:" ,mae)
This shows that the error is very less which means there is less loss and the model has performed very well. It also shows that the validation score is 0.86 which is good accuracy. As we see the model has searched 57 algorithms in the 5 minutes and has performed really well.
In this article, we saw how to use autosklearn and build both classification and regression models without having to specify the name of the algorithm. We achieved good results in both of these models. AutoSklearn can be really useful in business analytics and research to build faster and better models.