Building An ML Classification Model Using PyCaret

Decision making is an important aspect of our day to day lives. We, humans, make many decisions every day. From what is to be done in the day to what to wear for the day, whatever we choose to do makes a significant impact on the future. With every decision made comes the learning of the rights and wrongs of it. The same idea can be applied to our systems, and we can create our own decision making and classification models, making use of machine learning algorithms and model creation. A general task for most machine learning algorithms is to recognize objects and entities and separate them into categories. This is done through a process called classification. Classification can help us segregate or differentiate within the vast quantities of data into discrete values such as 0 or 1, True or False, or a pre-defined output label class. Classification and Regression tasks both belong to Supervised Learning, a type of machine learning algorithm where the model learns by example. Along with the input variable, we also provide our model with the corresponding correct labels. So, While training, the model looks at which label corresponds to our data and can find patterns between our data and those corresponding labels.

Humans predict how a thing can be referred to and differentiated to a particular class every day. We make use of classification, which helps us make decisions when picking vegetables, for example in a supermarket, whether they are “green”, “perfect”, “rotten”. Speaking In terms of machine learning, we assign a label of one of the classes to every vegetable we hold in our hands. The efficiency of one’s Vegetable Picking, or as some would call it, a classification model, depends on how accurate the decision results were. The more often one goes to the supermarket by himself, the better he will pick out fresh vegetables. Machine learning algorithms and models created work the same way. Classification can also be defined and created as a form of “pattern recognizer”. The classification algorithms applied to the training data might help find the same patterns such as a similar number sequence, words or sentiments within the data sets. 

To evaluate the accuracy of our classification model, we always require some accuracy measures. Methods such as Bias and Variance or Precision recall can be used to estimate how accurate our classifiers’ predictions are and how much they can be. Bias tells us the difference between our actual and predicted values, while variance tells us about the model’s sensitivity to fluctuations in the data. Precision can be used to calculate a model’s ability to classify values correctly; conversely, Recall is used to calculate the model’s ability to predict positive values. 

What is PyCaret?

PyCaret is an open-source machine learning library available in Python language that uses a lower number of codes and aims to reduce the number of hypotheses to insights within a cycle of time in a Machine Learning experiment created. Thus, it enables data scientists to perform end-to-end experiments more quickly and efficiently than other open-source machine learning libraries. With only a few lines of code, PyCaret enables us to perform complex machine learning tasks. A very simple and easy to use interface where all the operations performed are automatically stored in a custom PyCaret Pipeline that is fully orchestrated for and towards the development of models. PyCaret comes wrapped around with several other frameworks as well, such as scikit-learn, XGBoost, Microsoft LightGBM, spaCy, to name a few.

 PyCaret enables one to perform both simple and moderately sophisticated analytical tasks that would have required more expertise to understand and perform. PyCaret allows you to go from data preparation to model deployment in just a few seconds in the notebook environment you choose to run it in. Whether adding missing values or transforming categorical data, engineering the present features or optimizing hyperparameters in the present data, PyCaret can help automate it all. Its Machine Learning capabilities can also be seamlessly integrated with other environments supporting Python such as Microsoft Power BI, Tableau, Alteryx and KNIME. This gives immense power and flexibility to the users of these Business Intelligence platforms, who can now integrate PyCaret into the existing workflows and add a layer of Machine Learning without putting in much effort.

PyCaret can be used for the following data processing and pre-processing use cases  : 

  • Data Preparation
  • Training A Model
  • Hyperparameter Tuning within the Model
  • Creating Analysis and Deriving Interpretability
  • Model Selection 
  •  Experiment Logging 

Getting Started with PyCaret 

In this article, we will create a machine learning model, where we will be installing the Pycaret library and load up some custom dataset, more specifically a heart disease dataset where we are solving a binary classification problem by predicting whether or not a person does or does not have heart disease. Then using the PyCaret classification classes, build an automated machine learning classification model. The following execution is partially inspired by a PyCaret video tutorial whose link can be accessed from here

Installing PyCaret 

Our first step will be to Install the PyCaret library. To do so, you can use the following code.

!pip install pycaret pandas shap

 We have also used pandas for data analysis & shap, which will help us with our machine learning model’s interpretability of results.

Importing the Dependencies

 Going further, we will now import and call all our dependencies required to create this automated classification model using two lines of the following code. 

 import pandas as pd
 from pycaret.classification import * 
 Loading the Data 

We will now be importing the dataset. Here we have used a heart disease dataset for the classification model. The heart disease dataset used here is a modified version of the UCI ML repository dataset. Several categorical and numerical features are available and a target column present called “target”. We will try to predict a binary outcome as 1 or 0, where 1 means having been identified as having heart disease and 0 refers to the person not having a heart disease. The dataset can be downloaded from the link here.

Reading the dataset using pandas, 

 df = pd.read_csv('/content/heart.csv') 

Viewing the columns and first five heads from our loaded dataset, 


Viewing the data types present in the dataset,


We will get the following results,

 age           int64
 sex           int64
 cp            int64
 trestbps      int64
 chol          int64
 fbs           int64
 restecg       int64
 thalach       int64
 exang         int64
 oldpeak     float64
 slope         int64
 ca            int64
 thal          int64
 target        int64
 dtype: object 
Training and Evaluation for the model

PyCaret works on the concept of experiments, so the machine learning run and model being created here will be known as an experiment. Before we set up the experiment, we will define a list of categorical features for the model to understand. 

 cat_features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'thal'] 

Passing the categorical features into our experiment model to treat them in a better way. The setup() function will initiate the machine learning experiment and set up training pipelines. There are numerous other parameters as well that can be set for the experiment created within the function. The setup() function must also be called before executing any other function, its two mandatory parameters being “data” and “target”, which will be our main column for operation.

 #setting the experiment
 experiment = setup(df, target='target', categorical_features=cat_features)  

Output after the feature is set, 

Training the Model 

Using PyCaret, we can train our model on not one but a heap of different machine learning algorithms all at once to be able to predict the target model. It will show the best algorithm suited for the dataset ranking from the top. 

 #show the best model and their statistics
 best_model = compare_models() 

In our case we can see that the Ridge Classifier Model performs better than the others for our dataset processed. 

Testing the Model 

We will be using the best model received from PyCaret. It will automatically partition the data into train and test.

 #viewing results from the bottom
 predict_model(best_model, df.tail()) 

We can now see a new column called “Label” now, which shows our results predicted by the model. 

 Saving the created model as a pickle file, 
 save_model(best_model, model_name='ridge-model') 


 Transformation Pipeline and Model Successfully Saved
                   DataTypes_Auto_infer(categorical_features=['sex', 'cp', 'fbs',
                                                              'restecg', 'exang',
                                        display_types=True, features_todrop=[],
                                        numerical_features=[], target='target',
                  ('fix_perfect', Remove_100(target='target')),
                  ('clean_names', Clean_Colum_Names()),
                  ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
                  ('dfs', 'passthrough'), ('pca', 'passthrough'),
                   RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True,
                                   fit_intercept=True, max_iter=None,
                                   normalize=False, random_state=899,
                                   solver='auto', tol=0.001)]],
           verbose=False), 'ridge-model.pkl') 

We can again load the model and make predictions again, 

 model = load_model('ridge-model') 


Transformation Pipeline and Model Successfully Loaded

Verifying our Predictions again, 


Output :

 array([0, 1, 0, 0, 1]) 

So, as we can see, the prediction provides us again with the same array of predicted results as before!

What More?

You can also create visualizations from any trained models such as ROC curves, Feature graphs, confusion matrixes and much more. 

 lr = create_model('lr')
 # plotting a model
 plot_model(knn, plot = 'confusion_matrix') 


This article has talked about the PyCaret library and how fewer lines of code can enable us to create machine learning models at ease. We also got a hands-on view of what it takes to create a classification model using PyCaret, which can be saved for future use.  One can try to create and try different operations on more complex datasets to understand the power of PyCaret. The colab notebook to the above implementation can be found here.

Happy Learning!


Download our Mobile App

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox