MITB Banner

Guide To PyOD: A Python Toolkit For Outlier Detection

Share
PyOD

PyOD is a flexible and scalable toolkit designed for detecting outliers or anomalies in multivariate data; hence the name PyOD (Python Outlier Detection). It was introduced by Yue Zhao, Zain Nasrullah and Zeng Li in May 2019 (JMLR (Journal of Machine learning) paper).

Before going into the details of PyOD, let us understand in brief what outlier detection means.

What is outlier detection?

Outliers in data analysis refer to those data points which differ significantly from the majority of observations or do not conform to the trend/pattern followed by them. The process of identifying such suspicious data points is known as outlier detection. Detecting fraudulent transactions in the banking sector is an example of outlier detection. Following are some of our useful articles for detailed information on outlier detection:

Overview of PyOD

PyOD is an open-source Python toolbox that provides over 20 outlier detection algorithms till date – ranging from traditional techniques like local outlier factor to novel neural network architectures such as adversarial models or autoencoders. The complete list of supported algorithms is available here

Highlighting features of PyOD toolkit

  • It is compatible with both Python 2 and Python 3 across Linux, MacOS and Windows operating systems. The compatibility is achieved using six library.
  • PyOD can give cumulative results by combining various outlier detection methods, detectors and ensembles.
  • It includes an easy-to-use API and interactive examples for the supported algorithms.
  • Optimization techniques such as parallelization and Just-In-Time (JIT) compilation can be employed for selected models whenever required.
  • Practices such as unit testing, code coverage, continuous integration and code maintainability checks are considered by the supported models.

Essential dependencies

Practical implementation

Here’s a demonstration of applying eight different outlier detection algorithms using PyOD library and comparing their visualization results. The code demonstrated here is tested with Google Colab having Python 3.7.10 and PyOD 0.8.7 versions. 

Following models have been included in the demonstration:

  • Angle-Based Outlier Detector (ABOD)
  • Cluster-based Local Outlier Factor (CBLOF)
  • Isolation Forest
  • k-Nearest Neighbors (KNN)
  • Average KNN 
  • Local Outlier Factor (LOF)
  • One-Class SVM (OCSVM)
  • Principal Component Analysis (PCA)

Step-wise explanation of the code is as follows:

  1. Install PyOD and combo toolbox
 !pip install --upgarde pod
 !pip install combo 
  1. Import required libraries 
 from __future__ import division
 from __future__ import print_function
 import os
 import sys
 from time import time
 import numpy as np
 from numpy import percentile
 import matplotlib.pyplot as plt
 import matplotlib.font_manager 
  1. Import all the models to be used from pyod.models module
 from pyod.models.abod import ABOD
 from pyod.models.cblof import CBLOF
 from pyod.models.iforest import IForest
 from pyod.models.knn import KNN
 from pyod.models.lof import LOF
 from pyod.models.ocsvm import OCSVM
 from pyod.models.pca import PCA 
  1. Define number of inliers and outliers.
 num_samples = 500
 out_frac = 0.30 
  1. Initialize inliers and outliers data
 clusters_separation = [0]
 x, y = np.meshgrid(np.linspace(-7, 7, 100), np.linspace(-7, 7, 100))
"""
(1 - fraction of outliers) will give the fraction of inliers; multiplying  it with the total number #of samples will give number of inliers
"""
 num_inliers = int((1. - outl_frac) * num_samples)
"""
Multiply fraction of outliers with total number of samples to compute number of outliers
"""
 num_outliers = int(outl_frac * num_samples)
"""
Create ground truth array with 0 and 1 representing outliers and inliers respectively
"""
 ground_truth = np.zeros(num_samples, dtype=int)
 ground_truth[-num_outliers:] = 1 
  1. Display the number of inliers and outliers and the ground truth array.
 print('No. of inliers: %i' % num_inliers)
 print('No. of outliers: %i' % num_outliers)
 print('Ground truth arrayy shape is {shape}. Outlier are 1 and inlier are  
 0.\n'.format(shape=ground_truth.shape))
 print(ground_truth) 

Output:

  1. Define a dictionary of outlier detection methods to be compared
 rs = np.random.RandomState(42)  #random state
 #dictionary of classifiers
 clf = {    
     'Angle-based Outlier Detector (ABOD)':
         ABOD(contamination=out_frac),
     'Cluster-based Local Outlier Factor (CBLOF)':
         CBLOF(contamination=out_frac,
          check_estimator=False, random_state=rs),
     'Isolation Forest': IForest(contamination=out_frac,
                                 random_state=rs),
     'K Nearest Neighbors (KNN)': KNN(
         contamination=out_frac),
     'Average KNN': KNN(method='mean',
                        contamination=out_frac),
     'Local Outlier Factor (LOF)':
         LOF(n_neighbors=35, contamination=out_frac),
     'One-class SVM (OCSVM)': OCSVM(contamination=out_frac),
     'Principal Component Analysis (PCA)': PCA(
         contamination=out_frac, random_state=rs),
 } 
  1. Display the names of classifiers used
 for i, classifier in enumerate(clf.keys()):
     print('Model', i + 1, classifier) 

Output:

 Model 1 Angle-based Outlier Detector (ABOD)
 Model 2 Cluster-based Local Outlier Factor (CBLOF)
 Model 3 Isolation Forest
 Model 4 K Nearest Neighbors (KNN)
 Model 5 Average KNN
 Model 6 Local Outlier Factor (LOF)
 Model 7 One-class SVM (OCSVM)
 Model 8 Principal Component Analysis (PCA) 
  1. Fit the models to the data and visualize their results
 for i, offset in enumerate(clusters_separation):
     np.random.seed(42)
     # Data generation
     X1 = 0.3 * np.random.randn(num_inliers // 2, 2) - offset  #inliers data
     X2 = 0.3 * np.random.randn(num_inliers // 2, 2) + offset  #outlier data
 #Build an array having X1 and X2 using numpy.r_
     X = np.r_[X1, X2]
     # Add outliers to X array
     X = np.r_[X, np.random.uniform(low=-6, high=6, size=(num_outliers, 2))]
"""
numpy.random.uniform() draws samples from the uniform distribution of inliers and outliers
"""
    # Fit the models one-by-one
     plt.figure(figsize=(15, 12))
 #For each classifier to be tested
     for i, (classifier_name, classifier) in enumerate(clf.items()):
 #fit the classifier to data X
         classifier.fit(X)
       #compute confidence score
         scores_pred = classifier.decision_function(X) * -1
      #make prediction using the classifier
         y_pred = classifier.predict(X)
 #compute percentile rank of the confidence score
         threshold = percentile(scores_pred, 100 * out_frac)
"""
compute number of errors from difference between predicted and ground 
truth values
"""
         num_errors = (y_pred != ground_truth).sum()
         # plot the levels lines and the points
         Z = classifier.decision_function(np.c_[x.ravel(), y.ravel()]) * -1
         Z = Z.reshape(x.shape)
       #2 rows having 4 subplots each
         subplot = plt.subplot(2, 4, i + 1)
"""
plot filled and unfilled contours using contourf() and contour() respectively
"""
       subplot.contourf(x, y, Z, levels=np.linspace(Z.min(), threshold, 7),
                        cmap=plt.cm.Blues_r)
         a = subplot.contour(x, y, Z, levels=[threshold],
                             linewidths=2, colors='red')
       #for learned decision function
         subplot.contourf(x, y, Z, levels=[threshold, Z.max()],
                          colors='orange')
        #for true inliers
        b = subplot.scatter(X[:-num_outliers, 0], X[:-num_outliers, 1], 
        c='white',s=20, edgecolor='k')   
       #for true outliers
       c = subplot.scatter(X[-num_outliers:, 0], X[-num_outliers:, 1], 
       c='black',s=20, edgecolor='k')
         subplot.axis('tight')
        #legend of the subplots
         subplot.legend(
             [a.collections[0], b, c],
             ['learned decision function', 'true inliers', 'true outliers'],
             prop=matplotlib.font_manager.FontProperties(size=10),
             loc='lower right')
        # X-axis label
         subplot.set_xlabel("%d. %s (errors: %d)" % (i + 1, classifier_name, 
         num_errors))
      #marking limits of both the axes
         subplot.set_xlim((-7, 7))
         subplot.set_ylim((-7, 7))
    #layout parameters
     plt.subplots_adjust(0.04, 0.1, 0.96, 0.94, 0.1, 0.26)
 #centered title to be given to the figure
     plt.suptitle("Outlier detection by 8 models")
 plt.show()     #display the plots 

Output:

References

For in-depth understanding of the PyOD toolkit and its tutorials, refer to the following sources:

PS: The story was written using a keyboard.
Share
Picture of Nikita Shiledarbaxi

Nikita Shiledarbaxi

A zealous learner aspiring to advance in the domain of AI/ML. Eager to grasp emerging techniques to get insights from data and hence explore realistic Data Science applications as well.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India