PyOD is a flexible and scalable toolkit designed for detecting outliers or anomalies in multivariate data; hence the name PyOD (Python Outlier Detection). It was introduced by Yue Zhao, Zain Nasrullah and Zeng Li in May 2019 (JMLR (Journal of Machine learning) paper).
Before going into the details of PyOD, let us understand in brief what outlier detection means.
What is outlier detection?
Outliers in data analysis refer to those data points which differ significantly from the majority of observations or do not conform to the trend/pattern followed by them. The process of identifying such suspicious data points is known as outlier detection. Detecting fraudulent transactions in the banking sector is an example of outlier detection. Following are some of our useful articles for detailed information on outlier detection:
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Overview of PyOD
PyOD is an open-source Python toolbox that provides over 20 outlier detection algorithms till date – ranging from traditional techniques like local outlier factor to novel neural network architectures such as adversarial models or autoencoders. The complete list of supported algorithms is available here.
Highlighting features of PyOD toolkit
- It is compatible with both Python 2 and Python 3 across Linux, MacOS and Windows operating systems. The compatibility is achieved using six library.
- PyOD can give cumulative results by combining various outlier detection methods, detectors and ensembles.
- It includes an easy-to-use API and interactive examples for the supported algorithms.
- Optimization techniques such as parallelization and Just-In-Time (JIT) compilation can be employed for selected models whenever required.
- Practices such as unit testing, code coverage, continuous integration and code maintainability checks are considered by the supported models.
Essential dependencies
- Python 2.7 or >=3,5
- numpy >=1.13
- pandas >=0.25
- numba >=0.35
- scipy >=0.19.1
- scikit-learn >=0.19.1
- joblib
- combo >=0.0.8
- statsmodels
Practical implementation
Here’s a demonstration of applying eight different outlier detection algorithms using PyOD library and comparing their visualization results. The code demonstrated here is tested with Google Colab having Python 3.7.10 and PyOD 0.8.7 versions.
Following models have been included in the demonstration:
- Angle-Based Outlier Detector (ABOD)
- Cluster-based Local Outlier Factor (CBLOF)
- Isolation Forest
- k-Nearest Neighbors (KNN)
- Average KNN
- Local Outlier Factor (LOF)
- One-Class SVM (OCSVM)
- Principal Component Analysis (PCA)
Step-wise explanation of the code is as follows:
- Install PyOD and combo toolbox
!pip install --upgarde pod !pip install combo
- Import required libraries
from __future__ import division from __future__ import print_function import os import sys from time import time import numpy as np from numpy import percentile import matplotlib.pyplot as plt import matplotlib.font_manager
- Import all the models to be used from pyod.models module
from pyod.models.abod import ABOD from pyod.models.cblof import CBLOF from pyod.models.iforest import IForest from pyod.models.knn import KNN from pyod.models.lof import LOF from pyod.models.ocsvm import OCSVM from pyod.models.pca import PCA
- Define number of inliers and outliers.
num_samples = 500 out_frac = 0.30
- Initialize inliers and outliers data
clusters_separation = [0] x, y = np.meshgrid(np.linspace(-7, 7, 100), np.linspace(-7, 7, 100)) """ (1 - fraction of outliers) will give the fraction of inliers; multiplying it with the total number #of samples will give number of inliers """ num_inliers = int((1. - outl_frac) * num_samples) """ Multiply fraction of outliers with total number of samples to compute number of outliers """ num_outliers = int(outl_frac * num_samples) """ Create ground truth array with 0 and 1 representing outliers and inliers respectively """ ground_truth = np.zeros(num_samples, dtype=int) ground_truth[-num_outliers:] = 1
- Display the number of inliers and outliers and the ground truth array.
print('No. of inliers: %i' % num_inliers) print('No. of outliers: %i' % num_outliers) print('Ground truth arrayy shape is {shape}. Outlier are 1 and inlier are 0.\n'.format(shape=ground_truth.shape)) print(ground_truth)
Output:
- Define a dictionary of outlier detection methods to be compared
rs = np.random.RandomState(42) #random state #dictionary of classifiers clf = { 'Angle-based Outlier Detector (ABOD)': ABOD(contamination=out_frac), 'Cluster-based Local Outlier Factor (CBLOF)': CBLOF(contamination=out_frac, check_estimator=False, random_state=rs), 'Isolation Forest': IForest(contamination=out_frac, random_state=rs), 'K Nearest Neighbors (KNN)': KNN( contamination=out_frac), 'Average KNN': KNN(method='mean', contamination=out_frac), 'Local Outlier Factor (LOF)': LOF(n_neighbors=35, contamination=out_frac), 'One-class SVM (OCSVM)': OCSVM(contamination=out_frac), 'Principal Component Analysis (PCA)': PCA( contamination=out_frac, random_state=rs), }
- Display the names of classifiers used
for i, classifier in enumerate(clf.keys()): print('Model', i + 1, classifier)
Output:
Model 1 Angle-based Outlier Detector (ABOD) Model 2 Cluster-based Local Outlier Factor (CBLOF) Model 3 Isolation Forest Model 4 K Nearest Neighbors (KNN) Model 5 Average KNN Model 6 Local Outlier Factor (LOF) Model 7 One-class SVM (OCSVM) Model 8 Principal Component Analysis (PCA)
- Fit the models to the data and visualize their results
for i, offset in enumerate(clusters_separation): np.random.seed(42) # Data generation X1 = 0.3 * np.random.randn(num_inliers // 2, 2) - offset #inliers data X2 = 0.3 * np.random.randn(num_inliers // 2, 2) + offset #outlier data #Build an array having X1 and X2 using numpy.r_ X = np.r_[X1, X2] # Add outliers to X array X = np.r_[X, np.random.uniform(low=-6, high=6, size=(num_outliers, 2))] """ numpy.random.uniform() draws samples from the uniform distribution of inliers and outliers """ # Fit the models one-by-one plt.figure(figsize=(15, 12)) #For each classifier to be tested for i, (classifier_name, classifier) in enumerate(clf.items()): #fit the classifier to data X classifier.fit(X) #compute confidence score scores_pred = classifier.decision_function(X) * -1 #make prediction using the classifier y_pred = classifier.predict(X) #compute percentile rank of the confidence score threshold = percentile(scores_pred, 100 * out_frac) """ compute number of errors from difference between predicted and ground truth values """ num_errors = (y_pred != ground_truth).sum() # plot the levels lines and the points Z = classifier.decision_function(np.c_[x.ravel(), y.ravel()]) * -1 Z = Z.reshape(x.shape) #2 rows having 4 subplots each subplot = plt.subplot(2, 4, i + 1) """ plot filled and unfilled contours using contourf() and contour() respectively """ subplot.contourf(x, y, Z, levels=np.linspace(Z.min(), threshold, 7), cmap=plt.cm.Blues_r) a = subplot.contour(x, y, Z, levels=[threshold], linewidths=2, colors='red') #for learned decision function subplot.contourf(x, y, Z, levels=[threshold, Z.max()], colors='orange') #for true inliers b = subplot.scatter(X[:-num_outliers, 0], X[:-num_outliers, 1], c='white',s=20, edgecolor='k') #for true outliers c = subplot.scatter(X[-num_outliers:, 0], X[-num_outliers:, 1], c='black',s=20, edgecolor='k') subplot.axis('tight') #legend of the subplots subplot.legend( [a.collections[0], b, c], ['learned decision function', 'true inliers', 'true outliers'], prop=matplotlib.font_manager.FontProperties(size=10), loc='lower right') # X-axis label subplot.set_xlabel("%d. %s (errors: %d)" % (i + 1, classifier_name, num_errors)) #marking limits of both the axes subplot.set_xlim((-7, 7)) subplot.set_ylim((-7, 7)) #layout parameters plt.subplots_adjust(0.04, 0.1, 0.96, 0.94, 0.1, 0.26) #centered title to be given to the figure plt.suptitle("Outlier detection by 8 models") plt.show() #display the plots
Output:
- Code source: GitHub
- Google colab notebook of the above implementation
References
For in-depth understanding of the PyOD toolkit and its tutorials, refer to the following sources: