Guide To PyOD: A Python Toolkit For Outlier Detection

PyOD

PyOD is a flexible and scalable toolkit designed for detecting outliers or anomalies in multivariate data; hence the name PyOD (Python Outlier Detection). It was introduced by Yue Zhao, Zain Nasrullah and Zeng Li in May 2019 (JMLR (Journal of Machine learning) paper).

Before going into the details of PyOD, let us understand in brief what outlier detection means.

What is outlier detection?

Outliers in data analysis refer to those data points which differ significantly from the majority of observations or do not conform to the trend/pattern followed by them. The process of identifying such suspicious data points is known as outlier detection. Detecting fraudulent transactions in the banking sector is an example of outlier detection. Following are some of our useful articles for detailed information on outlier detection:

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Overview of PyOD

PyOD is an open-source Python toolbox that provides over 20 outlier detection algorithms till date – ranging from traditional techniques like local outlier factor to novel neural network architectures such as adversarial models or autoencoders. The complete list of supported algorithms is available here

Highlighting features of PyOD toolkit

  • It is compatible with both Python 2 and Python 3 across Linux, MacOS and Windows operating systems. The compatibility is achieved using six library.
  • PyOD can give cumulative results by combining various outlier detection methods, detectors and ensembles.
  • It includes an easy-to-use API and interactive examples for the supported algorithms.
  • Optimization techniques such as parallelization and Just-In-Time (JIT) compilation can be employed for selected models whenever required.
  • Practices such as unit testing, code coverage, continuous integration and code maintainability checks are considered by the supported models.

Essential dependencies

Practical implementation

Here’s a demonstration of applying eight different outlier detection algorithms using PyOD library and comparing their visualization results. The code demonstrated here is tested with Google Colab having Python 3.7.10 and PyOD 0.8.7 versions. 

Following models have been included in the demonstration:

  • Angle-Based Outlier Detector (ABOD)
  • Cluster-based Local Outlier Factor (CBLOF)
  • Isolation Forest
  • k-Nearest Neighbors (KNN)
  • Average KNN 
  • Local Outlier Factor (LOF)
  • One-Class SVM (OCSVM)
  • Principal Component Analysis (PCA)

Step-wise explanation of the code is as follows:

  1. Install PyOD and combo toolbox
 !pip install --upgarde pod
 !pip install combo 
  1. Import required libraries 
 from __future__ import division
 from __future__ import print_function
 import os
 import sys
 from time import time
 import numpy as np
 from numpy import percentile
 import matplotlib.pyplot as plt
 import matplotlib.font_manager 
  1. Import all the models to be used from pyod.models module
 from pyod.models.abod import ABOD
 from pyod.models.cblof import CBLOF
 from pyod.models.iforest import IForest
 from pyod.models.knn import KNN
 from pyod.models.lof import LOF
 from pyod.models.ocsvm import OCSVM
 from pyod.models.pca import PCA 
  1. Define number of inliers and outliers.
 num_samples = 500
 out_frac = 0.30 
  1. Initialize inliers and outliers data
 clusters_separation = [0]
 x, y = np.meshgrid(np.linspace(-7, 7, 100), np.linspace(-7, 7, 100))
"""
(1 - fraction of outliers) will give the fraction of inliers; multiplying  it with the total number #of samples will give number of inliers
"""
 num_inliers = int((1. - outl_frac) * num_samples)
"""
Multiply fraction of outliers with total number of samples to compute number of outliers
"""
 num_outliers = int(outl_frac * num_samples)
"""
Create ground truth array with 0 and 1 representing outliers and inliers respectively
"""
 ground_truth = np.zeros(num_samples, dtype=int)
 ground_truth[-num_outliers:] = 1 
  1. Display the number of inliers and outliers and the ground truth array.
 print('No. of inliers: %i' % num_inliers)
 print('No. of outliers: %i' % num_outliers)
 print('Ground truth arrayy shape is {shape}. Outlier are 1 and inlier are  
 0.\n'.format(shape=ground_truth.shape))
 print(ground_truth) 

Output:

  1. Define a dictionary of outlier detection methods to be compared
 rs = np.random.RandomState(42)  #random state
 #dictionary of classifiers
 clf = {    
     'Angle-based Outlier Detector (ABOD)':
         ABOD(contamination=out_frac),
     'Cluster-based Local Outlier Factor (CBLOF)':
         CBLOF(contamination=out_frac,
          check_estimator=False, random_state=rs),
     'Isolation Forest': IForest(contamination=out_frac,
                                 random_state=rs),
     'K Nearest Neighbors (KNN)': KNN(
         contamination=out_frac),
     'Average KNN': KNN(method='mean',
                        contamination=out_frac),
     'Local Outlier Factor (LOF)':
         LOF(n_neighbors=35, contamination=out_frac),
     'One-class SVM (OCSVM)': OCSVM(contamination=out_frac),
     'Principal Component Analysis (PCA)': PCA(
         contamination=out_frac, random_state=rs),
 } 
  1. Display the names of classifiers used
 for i, classifier in enumerate(clf.keys()):
     print('Model', i + 1, classifier) 

Output:

 Model 1 Angle-based Outlier Detector (ABOD)
 Model 2 Cluster-based Local Outlier Factor (CBLOF)
 Model 3 Isolation Forest
 Model 4 K Nearest Neighbors (KNN)
 Model 5 Average KNN
 Model 6 Local Outlier Factor (LOF)
 Model 7 One-class SVM (OCSVM)
 Model 8 Principal Component Analysis (PCA) 
  1. Fit the models to the data and visualize their results
 for i, offset in enumerate(clusters_separation):
     np.random.seed(42)
     # Data generation
     X1 = 0.3 * np.random.randn(num_inliers // 2, 2) - offset  #inliers data
     X2 = 0.3 * np.random.randn(num_inliers // 2, 2) + offset  #outlier data
 #Build an array having X1 and X2 using numpy.r_
     X = np.r_[X1, X2]
     # Add outliers to X array
     X = np.r_[X, np.random.uniform(low=-6, high=6, size=(num_outliers, 2))]
"""
numpy.random.uniform() draws samples from the uniform distribution of inliers and outliers
"""
    # Fit the models one-by-one
     plt.figure(figsize=(15, 12))
 #For each classifier to be tested
     for i, (classifier_name, classifier) in enumerate(clf.items()):
 #fit the classifier to data X
         classifier.fit(X)
       #compute confidence score
         scores_pred = classifier.decision_function(X) * -1
      #make prediction using the classifier
         y_pred = classifier.predict(X)
 #compute percentile rank of the confidence score
         threshold = percentile(scores_pred, 100 * out_frac)
"""
compute number of errors from difference between predicted and ground 
truth values
"""
         num_errors = (y_pred != ground_truth).sum()
         # plot the levels lines and the points
         Z = classifier.decision_function(np.c_[x.ravel(), y.ravel()]) * -1
         Z = Z.reshape(x.shape)
       #2 rows having 4 subplots each
         subplot = plt.subplot(2, 4, i + 1)
"""
plot filled and unfilled contours using contourf() and contour() respectively
"""
       subplot.contourf(x, y, Z, levels=np.linspace(Z.min(), threshold, 7),
                        cmap=plt.cm.Blues_r)
         a = subplot.contour(x, y, Z, levels=[threshold],
                             linewidths=2, colors='red')
       #for learned decision function
         subplot.contourf(x, y, Z, levels=[threshold, Z.max()],
                          colors='orange')
        #for true inliers
        b = subplot.scatter(X[:-num_outliers, 0], X[:-num_outliers, 1], 
        c='white',s=20, edgecolor='k')   
       #for true outliers
       c = subplot.scatter(X[-num_outliers:, 0], X[-num_outliers:, 1], 
       c='black',s=20, edgecolor='k')
         subplot.axis('tight')
        #legend of the subplots
         subplot.legend(
             [a.collections[0], b, c],
             ['learned decision function', 'true inliers', 'true outliers'],
             prop=matplotlib.font_manager.FontProperties(size=10),
             loc='lower right')
        # X-axis label
         subplot.set_xlabel("%d. %s (errors: %d)" % (i + 1, classifier_name, 
         num_errors))
      #marking limits of both the axes
         subplot.set_xlim((-7, 7))
         subplot.set_ylim((-7, 7))
    #layout parameters
     plt.subplots_adjust(0.04, 0.1, 0.96, 0.94, 0.1, 0.26)
 #centered title to be given to the figure
     plt.suptitle("Outlier detection by 8 models")
 plt.show()     #display the plots 

Output:

References

For in-depth understanding of the PyOD toolkit and its tutorials, refer to the following sources:

Nikita Shiledarbaxi
A zealous learner aspiring to advance in the domain of AI/ML. Eager to grasp emerging techniques to get insights from data and hence explore realistic Data Science applications as well.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR