Active Hackathon

A Hands-On Guide to Outlier Detection with Alibi Detect

The detection of dataset elements that differ significantly from the majority of instances is known as outlier detection. There are various visualization methods and statistical tests, such as z-test, Grubb's test and other algorithms used to detect them.

The detection of dataset elements that differ significantly from the majority of instances is known as outlier detection. There are various visualization methods and statistical tests, such as z-test, Grubb’s test and other algorithms used to detect them. The Alibi Detect is a toolbox, which is used to detect anomalies such as outliers, dataset drift, and adversarial attacks in a variety of data types such as tabular data, images, time series, and so on, in the context of AutoML. We will discuss this toolbox in detail in this post. Below is a list of the major points to be discussed.

Table Of Contents

  1. What is Outlier Detection?
  2. Algorithms Used to Detect Outliers
  3. How can Alibi Detect be used?
  4. Detecting the outlier using Alibi Detect

Let’s start the discussion by understanding Outlier Detection.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

What is Outlier Detection?

Data points that are unusually far apart from the rest of the observations in a dataset are known as outliers. They are primarily caused by data errors (measurement or experimental errors, data collection or processing errors, and so on) or naturally very singular and different behaviour from the norm, for example, in medical applications, very few people have upper blood pressure greater than 200, so If we keep them in the dataset, our statistical analysis, and modelling conclusions will be skewed. 

To name a few, they can alter the mean and standard deviation values. As a result, it’s critical to accurately detect and handle outliers, either by removing them or reducing them to a predefined value. Outlier detection is thus critical for identifying anomalies whose model predictions we can’t trust and shouldn’t use in production. 

The type of outlier detector that is appropriate for a given application is determined by the data’s modality and dimensionality, as well as the availability of labelled normal and outlier data and whether the detector is pre-trained (offline) or updated online. The offline detector can be deployed as a stateful application, while the pre-trained detector can be deployed as a static machine learning model.

Algorithms Used to Detect Outliers

Mahalanobis Distance

The goal of the Mahalanobis online outlier detection is to predict anomalies in tabular data. The algorithm computes an outlier score, which is a measure of distance from the feature distribution’s centre (Mahalanobis distance). If this outlier score exceeds a user-specified threshold, the observation is marked as an outlier. 

The algorithm is online, which means it begins with no knowledge of feature distribution and learns as requests arrive. As a result, you should expect the output to be poor at first and improve over time. The algorithm works well with low to medium dimensional tabular data.

Isolation Forest

Isolation forests (IF) are tree-based methods for detecting outliers. The IF isolates observations by randomly selecting a feature and then randomly determining a split value between the feature’s maximum and minimum values. The number of splittings necessary to isolate a sample is equal to the length of the path from the root node to the terminating node. 

When averaged over a forest of random trees, this path length is a measure of normalcy that is used to create an anomaly score. Outliers are typically isolated more quickly, resulting in shorter routes. The technique performs effectively with tabular data in the low to medium dimension range.

Variational Auto-Encoders

The outlier detector, the Variational Auto-Encoder (VAE), is first trained on a batch of unlabeled but normal (inlier) data. Because labelled data is often scarce, unsupervised or semi-supervised training is preferable. The VAE detector makes an attempt to reconstruct the data it receives. The reconstruction error is high if the input data cannot be reconstructed well, and the data can be flagged as an outlier. 

The mean squared error (MSE) between the input and the reconstructed instance or the probability that both the input and the reconstructed instance are generated by the same process is used to calculate the reconstruction error. This algorithm works well with both tabular and image data.

Sequence-to-Sequence 

The Sequence-to-Sequence (Seq2Seq) outlier detector is made up of two main components: an encoder and a decoder. A Bidirectional LSTM processes the input sequence and initializes the decoder in the encoder. The LSTM decoder then predicts the output sequence sequentially. The decoder’s goal, in this case, is to reconstruct the input sequence. 

If the input data cannot be well reconstructed, the reconstruction error is high, and the data is flagged as an outlier. The mean squared error (MSE) between the input and the reconstructed instance is used to calculate the reconstruction error.

Below Table is shown summarizes which algorithms under the hood of this toolbox can be used for outlier detection based on the type of data. 

How can Alibi Detect be used?

Alibi Detect is a Python library for detecting outliers, adversarial data, and drift. The package aims to include detectors for tabular data, text, images, and time series that can be used both online and offline. For drift detection, both TensorFlow and PyTorch backends are supported.

In fact, Alibi Detect supports a variety of outlier detection techniques, including Mahalanobis distance, Isolation forest, and Seq2seq. The library can also handle a variety of data types, including tabular, image, text, and time series. Different types of algorithms are required depending on the type of data. Now let’s take a look at that algorithm briefly so that we can have a basic understanding.

Detecting the Outlier

To find outliers, we’ll use the Inception forest algorithm. The dataset we’re using here is built-in toolbox data for detecting computer network intrusions using Transmission Control Protocol (TCP) dump data for a simulated local-area network (LAN).

 A connection is a set of TCP packets that start and stop at predetermined times and transport data from a source IP address to a destination IP address using a predetermined protocol. Each connection is classified as either safe or dangerous.

Let’s start by installing and importing the dependencies.

! pip install alibi-detect

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
from sklearn.metrics import confusion_matrix, f1_score
 
from alibi_detect.od import IForest
from alibi_detect.datasets import fetch_kdd
from alibi_detect.utils.data import create_outlier_batch
from alibi_detect.utils.fetching import fetch_detector
from alibi_detect.utils.saving import save_detector, load_detector
from alibi_detect.utils.visualize import plot_instance_score, plot_roc

Now we will load the dataset and define a normal batch of data and normalize the data.

# load data
kddcup = fetch_kdd(percent10=True)  # only load 10% of the dataset

# create normal batch
np.random.seed(0)
normal_batch = create_outlier_batch(kddcup.data, kddcup.target, n_samples=400000, perc_outlier=0)
X_train, y_train = normal_batch.data.astype('float'), normal_batch.target

# apply normalization
mean, stdev = X_train.mean(axis=0), X_train.std(axis=0)
X_train = (X_train - mean) / stdev

Next, we will define an outlier detector.

# initialize outlier detector
od = IForest(threshold=None,  # threshold for outlier score
             n_estimators=100)
# train
od.fit(X_train)
# save the trained outlier detector
save_detector(od, '/content/')

The above definition will return a warning as:

We still need to establish the outlier threshold, according to the warning. The infer_threshold method can be used to accomplish this. We’ll need to pass a batch of instances and use threshold_perc to define what percentage of them we consider typical. Assume we have some data with a known percentage of outliers of roughly 5%. In the create_outlier_batch function, perc_outlier can be used to set the proportion of outliers.

# create batch of outlier
np.random.seed(0)
perc_outlier = 5
threshold_batch = create_outlier_batch(kddcup.data, kddcup.target, n_samples=1000, perc_outlier=perc_outlier)
X_threshold, y_threshold = threshold_batch.data.astype('float'), threshold_batch.target
X_threshold = (X_threshold - mean) / stdev
print('{}% outliers'.format(100 * y_threshold.mean()))
 
# add threshold to detector
od.infer_threshold(X_threshold, threshold_perc=100-perc_outlier)
print('New threshold: {}'.format(od.threshold))

Here is the settled outlier and threshold:

Now, similar to before, we construct a batch of data containing 10% outliers and use our detector to find the outliers in the batch.

# new batch for prediction
np.random.seed(1)
outlier_batch = create_outlier_batch(kddcup.data, kddcup.target, n_samples=1000, perc_outlier=10)
X_outlier, y_outlier = outlier_batch.data.astype('float'), outlier_batch.target
X_outlier = (X_outlier - mean) / stdev
print(X_outlier.shape, y_outlier.shape)
print('{}% outliers'.format(100 * y_outlier.mean()))

# predicting
od_preds = od.predict(X_outlier, return_instance_score=True)

Now to evaluate the performance of this model we will use the confusion matrix, Fa score and accuracy score for the actual outlier and predicted outlier. 

labels = outlier_batch.target_names
y_pred = od_preds['data']['is_outlier']
f1 = f1_score(y_outlier, y_pred)
acc = accuracy_score(y_outlier, y_pred)
print('F1 score: {:.4f},\n Accuracy Score: {:.4f}'.format(f1, acc))
cm = confusion_matrix(y_outlier, y_pred)
df_cm = pd.DataFrame(cm, index=labels, columns=labels)
sns.heatmap(df_cm, annot=True, cbar=True, linewidths=.5)
plt.show()

Now we can also plot the instance level outlier scores Vs the outlier threshold for better understanding using the method plot_instance_score. 

plot_instance_score(od_preds, y_outlier, labels, od.threshold)

Final Words

With an outlier score of about 0, we can see that the isolation forest does not perform a good job of recognizing one type of outlier. This makes determining a good threshold without knowing the outliers type is difficult. Setting the threshold slightly below 0 would result in much-improved detector performance for the dataset’s outliers.

We have talked about how critical it is to spot outliers in our data distribution in this post. In the topic of outlier detection, we’ve encountered Alibi detect, a toolbox that can detect outliers, anomalies, and adversarial attacks in a variety of data formats. We’ve seen how we can use this toolbox to discover outliers in particular.

References

More Great AIM Stories

Vijaysinh Lendave
Vijaysinh is an enthusiast in machine learning and deep learning. He is skilled in ML algorithms, data manipulation, handling and visualization, model building.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR

The curious case of Google Cloud revenue

Porat had earlier said that Google Cloud was putting in money to make more money, but even with the bucket-loads of money that it was making, profitability was still elusive.

Global Parliaments can do much more with Artificial Intelligence

The world is using AI to enhance the performance of its policymakers. India, too, has launched its own machine learning system NeVA, which at the moment is not fully implemented across the nation. How can we learn and adopt from the advancement in the Parliaments around the world? 

Why IISc wins?

IISc was selected as the world’s top research university, trumping some of the top Ivy League colleges in the QS World University Rankings 2022

[class^="wpforms-"]
[class^="wpforms-"]