MITB Banner

A hands-on guide to linear discriminant analysis for binary classification

The Linear Discriminant Analysis (LDA) is a method to separate the data points by learning relationships between the high dimensional data points and the learner line. It reduces the high dimensional data to linear dimensional data.
Share
Listen to this story

The Linear Discriminant Analysis (LDA) is a method to separate the data points by learning relationships between the high dimensional data points and the learner line. It reduces the high dimensional data to linear dimensional data. LDA is also used as a tool for classification, dimension reduction, and data visualization. The LDA method often produces robust, decent, and interpretable classification results despite its simplicity. In this article, we will have an interdiction to LDA specifically focusing on its classification capabilities. Below is a list of points that we will cover in this article.

Table of contents

  1. Why was Linear Discriminant Analysis introduced?
  2. About Linear Discriminant Analysis (LDA)
  3. LDA for Binary Classification Python

Let’s understand why Linear Discriminant Analysis (LDA) was introduced for classification.

Why was LDA introduced?

To understand the reason for the introduction of LDA first we need to understand what went wrong with Logistic Regression. There are the following limitations faced by the logistic regression technique:

  • Logistic Regression is traditionally used for binary classification problems. Though it can be extrapolated and used in multi-class classification, this is rarely performed. 
  • Logistic Regression can lack stability when the classes are well-separated. So, there is instability when classes are well separated.
  • Logistic Regression is unstable when there are fewer features parameters to be estimated.

All the above reasons for the failure of logistic regression have been rectified in LDA. Moving further in the article let’s have a brief introduction to our benchmark classifier the Linear Discriminant Analysis (LDA).

Are you looking for a complete repository of Python libraries used in data science, check out here.

About Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis is a technique for classifying binary and non-binary features using and linear algorithm for learning the relationship between the dependent and independent features. It uses the Fischer formula to reduce the dimensionality of the data so as to fit in a linear dimension. LDA is a multi-functional algorithm, it is a classifier, dimensionality reducer and data visualizer. The aim of LDA is:

  • To minimize the inter-class variability which refers to classifying as many similar points as possible in one class. This ensures fewer misclassifications.
  • To maximize the distance between the mean of classes, the mean is placed as far as possible to ensure high confidence during prediction.

Image link

In the above representation, there are two classifications created with data having a fixed covariance. It is explained with the help of a linear relationship line. 

Let’s implement LDA on the data and observe the outcomes.

LDA for Binary Classification Python

The Linear Discriminant Analysis package is present in the sklearn library. Let’s start with importing some libraries that are going to be used for this article.

Importing library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import make_classification
from sklearn.metrics import ConfusionMatrixDisplay,precision_score,recall_score,confusion_matrix
import warnings
warnings.filterwarnings('ignore')

Reading the data

df=pd.read_csv("/content/drive/MyDrive/Datasets/heart.csv")
print("no of rows=",df.shape[0],"\nno of columns=",df.shape[1])
df.head()

There are a total of 14 features including the dependent variable “output”. This data is related to the prediction of chances of having a heart attack.

Preparing data for training

X=df.drop('output',axis=1)
y=df['output']
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.30, random_state=42)

Split the data in train and test with a ratio of 70:30 respectively. The shape of the train and test dependent and independent variables data is shown above.

Fitting LDA with data

LDA = LinearDiscriminantAnalysis()
LDA.fit(X_train, y_train)
LDA_pred=LDA.predict(X_test)

The model is now trained on the training data and it has predicted values using the test data. Let’s see the performance factors for the precision of prediction.

Creating Confusion Matrix

tn, fp, fn, tp = confusion_matrix(list(y_test), list(LDA_pred), labels=[0, 1]).ravel()
 
print('True Positive', tp)
print('True Negative', tn)
print('False Positive', fp)
print('False Negative', fn)
 
ConfusionMatrixDisplay.from_predictions(y_test, LDA_pred)
plt.show()

With help of the above confusion matrix, we can calculate the precision and recall score. We will be using the precision score for this problem because it is the positive predicted value so in this case, the positive is how many patients are suffering from a heart attack.

Calculating precision score

print("Precision score",precision_score(y_test,LDA_pred))
Precision score 0.82

Our model has a precision of 82% in predicting the patient is suffering from a heart attack. This score is not good because in medical-related problems the model needs to be 99% confident about the decision. So, for improving the model performance the goal is to reduce the False-negative and False positive which is also known as type 1 and type 2 error respectively. In this case, we should focus more on False-negative reduction because they are those patients who are or will be suffering from a heart attack but the model has classified them as negative (non-heart attack patients).

Final verdict

Linear Discriminant Analysis uses distance to the class mean which is easier to interpret, uses linear decision boundary for explaining the classification and it reduces the dimensionality. With a hands-on implementation of this concept in this article, we could understand how Linear Discriminant Analysis is used in classification.

References

PS: The story was written using a keyboard.
Picture of Sourabh Mehta

Sourabh Mehta

Sourabh has worked as a full-time data scientist for an ISP organisation, experienced in analysing patterns and their implementation in product development. He has a keen interest in developing solutions for real-time problems with the help of data both in this universe and metaverse.
Related Posts

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories

Featured

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

AIM Conference Calendar

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives. Revel in intimate events that encapsulate the heart and soul of the AI Industry.

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed