Active Hackathon

A hands-on guide to linear discriminant analysis for binary classification

The Linear Discriminant Analysis (LDA) is a method to separate the data points by learning relationships between the high dimensional data points and the learner line. It reduces the high dimensional data to linear dimensional data.
Listen to this story

The Linear Discriminant Analysis (LDA) is a method to separate the data points by learning relationships between the high dimensional data points and the learner line. It reduces the high dimensional data to linear dimensional data. LDA is also used as a tool for classification, dimension reduction, and data visualization. The LDA method often produces robust, decent, and interpretable classification results despite its simplicity. In this article, we will have an interdiction to LDA specifically focusing on its classification capabilities. Below is a list of points that we will cover in this article.

Table of contents

  1. Why was Linear Discriminant Analysis introduced?
  2. About Linear Discriminant Analysis (LDA)
  3. LDA for Binary Classification Python

Let’s understand why Linear Discriminant Analysis (LDA) was introduced for classification.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Why was LDA introduced?

To understand the reason for the introduction of LDA first we need to understand what went wrong with Logistic Regression. There are the following limitations faced by the logistic regression technique:

  • Logistic Regression is traditionally used for binary classification problems. Though it can be extrapolated and used in multi-class classification, this is rarely performed. 
  • Logistic Regression can lack stability when the classes are well-separated. So, there is instability when classes are well separated.
  • Logistic Regression is unstable when there are fewer features parameters to be estimated.

All the above reasons for the failure of logistic regression have been rectified in LDA. Moving further in the article let’s have a brief introduction to our benchmark classifier the Linear Discriminant Analysis (LDA).

Are you looking for a complete repository of Python libraries used in data science, check out here.

About Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis is a technique for classifying binary and non-binary features using and linear algorithm for learning the relationship between the dependent and independent features. It uses the Fischer formula to reduce the dimensionality of the data so as to fit in a linear dimension. LDA is a multi-functional algorithm, it is a classifier, dimensionality reducer and data visualizer. The aim of LDA is:

  • To minimize the inter-class variability which refers to classifying as many similar points as possible in one class. This ensures fewer misclassifications.
  • To maximize the distance between the mean of classes, the mean is placed as far as possible to ensure high confidence during prediction.

Image link

In the above representation, there are two classifications created with data having a fixed covariance. It is explained with the help of a linear relationship line. 

Let’s implement LDA on the data and observe the outcomes.

LDA for Binary Classification Python

The Linear Discriminant Analysis package is present in the sklearn library. Let’s start with importing some libraries that are going to be used for this article.

Importing library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import make_classification
from sklearn.metrics import ConfusionMatrixDisplay,precision_score,recall_score,confusion_matrix
import warnings
warnings.filterwarnings('ignore')

Reading the data

df=pd.read_csv("/content/drive/MyDrive/Datasets/heart.csv")
print("no of rows=",df.shape[0],"\nno of columns=",df.shape[1])
df.head()

There are a total of 14 features including the dependent variable “output”. This data is related to the prediction of chances of having a heart attack.

Preparing data for training

X=df.drop('output',axis=1)
y=df['output']
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.30, random_state=42)

Split the data in train and test with a ratio of 70:30 respectively. The shape of the train and test dependent and independent variables data is shown above.

Fitting LDA with data

LDA = LinearDiscriminantAnalysis()
LDA.fit(X_train, y_train)
LDA_pred=LDA.predict(X_test)

The model is now trained on the training data and it has predicted values using the test data. Let’s see the performance factors for the precision of prediction.

Creating Confusion Matrix

tn, fp, fn, tp = confusion_matrix(list(y_test), list(LDA_pred), labels=[0, 1]).ravel()
 
print('True Positive', tp)
print('True Negative', tn)
print('False Positive', fp)
print('False Negative', fn)
 
ConfusionMatrixDisplay.from_predictions(y_test, LDA_pred)
plt.show()

With help of the above confusion matrix, we can calculate the precision and recall score. We will be using the precision score for this problem because it is the positive predicted value so in this case, the positive is how many patients are suffering from a heart attack.

Calculating precision score

print("Precision score",precision_score(y_test,LDA_pred))
Precision score 0.82

Our model has a precision of 82% in predicting the patient is suffering from a heart attack. This score is not good because in medical-related problems the model needs to be 99% confident about the decision. So, for improving the model performance the goal is to reduce the False-negative and False positive which is also known as type 1 and type 2 error respectively. In this case, we should focus more on False-negative reduction because they are those patients who are or will be suffering from a heart attack but the model has classified them as negative (non-heart attack patients).

Final verdict

Linear Discriminant Analysis uses distance to the class mean which is easier to interpret, uses linear decision boundary for explaining the classification and it reduces the dimensionality. With a hands-on implementation of this concept in this article, we could understand how Linear Discriminant Analysis is used in classification.

References

More Great AIM Stories

Sourabh Mehta
Sourabh has worked as a full-time data scientist for an ISP organisation, experienced in analysing patterns and their implementation in product development. He has a keen interest in developing solutions for real-time problems with the help of data both in this universe and metaverse.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM