Step-by-Step Building Block For Machine Learning Models

machine learning steps

Machine learning is a process where the machine can learn hidden patterns from the data and has the potential to give predictions. It is also called the subset and application of Artificial Intelligence. There are many different real-life use cases of machine learning that are widely used today for example, in the banking sector where the authorities use machine learning models to predict whether a loan applicant will be a defaulter or not. The website that generates your credit score also uses machine learning for calculations. There are mainly two types of tasks that are done in machine learning that includes Classification and Regression. Classification is a task where predictive models are trained to classify data into different classes like classifying different fruits by passing images to the model whereas regression is a task where models are built to predict continuous variables like predicting the temperature of the next day.

In this article, we will explore classification tasks mainly and we will see how to build a classification model in machine learning following the different steps that are required. We will make use of Iris data set that is publicly available for downloading on the UCI Machine learning Repository. The data set contains the length and width of sepals and petals with their respective species. We will build a machine learning model that would be able to predict which species the flower belongs to when we pass these lengths of the flower to the model.

What Will You Learn From This Article? 

  1. Import data from csv files. 
  2. Exploratory Data analysis
  3. Data visualisation
  4. Splitting data into training and testing 
  5. Building machine learning models
  6. Predictions by the models
  7. Model Evaluation 
  1. Importing the data from csv files

There is a function in the pandas package that is widely used for importing datasets. It allows you to import data in different formats like csv files, xlsx, etc. We will make use of the same function. Use the below code to import the data set and print the first 10 rows in the data. We will first import the pandas package and then read the data. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

import pandas as pd

df = pd.read_csv(‘iris.csv’)



  1. Exploratory Data Analysis 

Exploratory data analysis (EDA) is a process where we explore the data set to get familiar with it. We find out the shape of the data, missing values, data type, etc. All the tasks that are done on data before building a machine learning model come under EDA. We will now explore the data set we just imported. Use the below code to check for basic EDA. 






We found there are 150 rows and 5 columns in this data having no null values. Species column is an object type column, and all others have float type values. There were no missing values in the data. We now transform the categorical column species using Label Encoder. Use the code shown below to do the conversion. 

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()



We will now check the descriptive statistics of the data and correlation between the columns of the data. Use the below code to do the same. 





  1. Data Visualisation

It is the graphical representation of data that is used to check about the presence of outliers, patterns, distribution of the data, etc. There are different data visualisation libraries in python that include matplotlib, seaborn, etc. We will make use of the seaborn library to visualise the pairplots. Use the below code to check the pairplot. We will first import the seaborn library and then print the pairplot. 

import seaborn as sns

sns.pairplot(df, hue='species')


  1. Splitting data into training and testing

Before building a machine learning model, data is always split into two different parts that are called Training and Testing. For the training purpose of the model, we only expose the training data and never allow testing data to be exposed. Once the model gets trained using that data, we make use of the model to compute the predictions over the testing data, which is stored in a single variable known as y_pred. We can store it in a different variable as well. We will first define the independent variable and dependent variable X and y, respectively. Now we will split our data. Use the below code to the same. 




y= df[‘species’]



machine learning steps

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30 , random_state=1)






machine learning steps

We have split the data and checked the shape of training as well as testing data.

  1. Building Machine Learning Models

We will now build the machine learning model using two different machine learning algorithms that are Logistic Regression and Random Forest. Logistics regression comes from linear models, whereas random forest is an ensemble method. We will first import these and then will pass the training data to both the models. After it gets trained, we will compute predictions over testing data and store in different variables. Use the below code to the same.

Model 1

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(),y_train)

y_pred_lr = lr.predict(X_test)

Model 2

from sklearn.ensemble import RandomForestClassifier

rfcl = RandomForestClassifier(),y_train)

y_pred_rf = rfcl.predict(X_test)

  1. Prediction by the Models 

We will now compute predictions for some rows and check if the model can predict correctly. We will make predictions of 10-15 rows with model 1 and 15-20 with model 2. After prediction, we will compare them with the actual class. 

print("Prediction by model 2: ", lr.predict(X_train.iloc[10:15]))

print("\nActual Labels: \n",y_train.iloc[15:20])


print("Prediction by model 2: ", rfcl.predict(X_train.iloc[15:20]))

print("\nActual Labels: \n",y_train.iloc[15:20])


machine learning steps

We can see that both the models have given the correct predictions for the respective predictions we made.

  1. Model Evaluation 

Model evaluation is a technique where we check about the performance of the model by computing different error metrics.  There are many different error metrics like accuracy, confusion matrix, mean squared error, mean absolute error that is used to check the performance in classification as well as regression tasks. We have built our model for classification purposes so we would compute metrics that are used to evaluate the classification model. 

We will first compute the accuracy score followed by the confusion matrix and classification report. Use the below code to compute the same. 

Accuracy Score

from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

print("Logistic Regression: ",accuracy_score(y_pred_lr,y_test))

print("Random Forest: ", accuracy_score(y_pred_rf,y_test))


Confusion Matrix

print("Logistic Regression: \n",confusion_matrix(y_pred_lr,y_test))

print("\nRandom Forest: \n",confusion_matrix(y_pred_rf,y_test))


machine learning steps

Classification Report

print("Logistic Regression: \n",classification_report(y_pred_lr,y_test))

print("\nRandom Forest: \n",classification_report(y_pred_rf,y_test))


machine learning steps


I would conclude the article by hoping that now you have understood every step that is required to be done to build a machine learning model. We have built the classification model for classifying the species of flower and then evaluated it using different error metrics. You can now check this article “Hands-on-Guide to machine learning model deployment using Flask” where you can learn how to deploy these models and check their performance in real-time. Also, check this article that is titled “Model Evaluation and Error Metrics” where you can learn more on error metrics for model evaluation.

Rohit Dwivedi
I am currently enrolled in a Post Graduate Program In Artificial Intelligence and Machine learning. Data Science Enthusiast who likes to draw insights from the data. Always amazed with the intelligence of AI. It's really fascinating teaching a machine to see and understand images. Also, the interest gets doubled when the machine can tell you what it just saw. This is where I say I am highly interested in Computer Vision and Natural Language Processing. I love exploring different use cases that can be build with the power of AI. I am the person who first develops something and then explains it to the whole community with my writings.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox