The detection of dataset elements that differ significantly from the majority of instances is known as outlier detection. There are various visualization methods and statistical tests, such as z-test, Grubb’s test and other algorithms used to detect them. The Alibi Detect is a toolbox, which is used to detect anomalies such as outliers, dataset drift, and adversarial attacks in a variety of data types such as tabular data, images, time series, and so on, in the context of AutoML. We will discuss this toolbox in detail in this post. Below is a list of the major points to be discussed.

**Table Of Contents**

- What is Outlier Detection?
- Algorithms Used to Detect Outliers
- How can Alibi Detect be used?
- Detecting the outlier using Alibi Detect

Let’s start the discussion by understanding Outlier Detection.

#### THE BELAMY

##### Sign up for your weekly dose of what's up in emerging technology.

**What is Outlier Detection?**

Data points that are unusually far apart from the rest of the observations in a dataset are known as outliers. They are primarily caused by data errors (measurement or experimental errors, data collection or processing errors, and so on) or naturally very singular and different behaviour from the norm, for example, in medical applications, very few people have upper blood pressure greater than 200, so If we keep them in the dataset, our statistical analysis, and modelling conclusions will be skewed.

To name a few, they can alter the mean and standard deviation values. As a result, it’s critical to accurately detect and handle outliers, either by removing them or reducing them to a predefined value. Outlier detection is thus critical for identifying anomalies whose model predictions we can’t trust and shouldn’t use in production.

The type of outlier detector that is appropriate for a given application is determined by the data’s modality and dimensionality, as well as the availability of labelled normal and outlier data and whether the detector is pre-trained (offline) or updated online. The offline detector can be deployed as a stateful application, while the pre-trained detector can be deployed as a static machine learning model.

**Algorithms Used to Detect Outliers**

*Mahalanobis Distance*

*Mahalanobis Distance*

The goal of the Mahalanobis online outlier detection is to predict anomalies in tabular data. The algorithm computes an outlier score, which is a measure of distance from the feature distribution’s centre (Mahalanobis distance). If this outlier score exceeds a user-specified threshold, the observation is marked as an outlier.

The algorithm is online, which means it begins with no knowledge of feature distribution and learns as requests arrive. As a result, you should expect the output to be poor at first and improve over time. The algorithm works well with low to medium dimensional tabular data.

*Isolation Forest*

*Isolation Forest*

Isolation forests (IF) are tree-based methods for detecting outliers. The IF isolates observations by randomly selecting a feature and then randomly determining a split value between the feature’s maximum and minimum values. The number of splittings necessary to isolate a sample is equal to the length of the path from the root node to the terminating node.

When averaged over a forest of random trees, this path length is a measure of normalcy that is used to create an anomaly score. Outliers are typically isolated more quickly, resulting in shorter routes. The technique performs effectively with tabular data in the low to medium dimension range.

*Variational Auto-Encoders*

*Variational Auto-Encoders*

The outlier detector, the Variational Auto-Encoder (VAE), is first trained on a batch of unlabeled but normal (inlier) data. Because labelled data is often scarce, unsupervised or semi-supervised training is preferable. The VAE detector makes an attempt to reconstruct the data it receives. The reconstruction error is high if the input data cannot be reconstructed well, and the data can be flagged as an outlier.

The mean squared error (MSE) between the input and the reconstructed instance or the probability that both the input and the reconstructed instance are generated by the same process is used to calculate the reconstruction error. This algorithm works well with both tabular and image data.

*Sequence-to-Sequence** *

*Sequence-to-Sequence*

The Sequence-to-Sequence (Seq2Seq) outlier detector is made up of two main components: an encoder and a decoder. A Bidirectional LSTM processes the input sequence and initializes the decoder in the encoder. The LSTM decoder then predicts the output sequence sequentially. The decoder’s goal, in this case, is to reconstruct the input sequence.

If the input data cannot be well reconstructed, the reconstruction error is high, and the data is flagged as an outlier. The mean squared error (MSE) between the input and the reconstructed instance is used to calculate the reconstruction error.

Below Table is shown summarizes which algorithms under the hood of this toolbox can be used for outlier detection based on the type of data.

**How can Alibi Detect be used?**

Alibi Detect is a Python library for detecting outliers, adversarial data, and drift. The package aims to include detectors for tabular data, text, images, and time series that can be used both online and offline. For drift detection, both TensorFlow and PyTorch backends are supported.

In fact, Alibi Detect supports a variety of outlier detection techniques, including Mahalanobis distance, Isolation forest, and Seq2seq. The library can also handle a variety of data types, including tabular, image, text, and time series. Different types of algorithms are required depending on the type of data. Now let’s take a look at that algorithm briefly so that we can have a basic understanding.

**Detecting the Outlier**

To find outliers, we’ll use the Inception forest algorithm. The dataset we’re using here is built-in toolbox data for detecting computer network intrusions using Transmission Control Protocol (TCP) dump data for a simulated local-area network (LAN).

A connection is a set of TCP packets that start and stop at predetermined times and transport data from a source IP address to a destination IP address using a predetermined protocol. Each connection is classified as either safe or dangerous.

Let’s start by installing and importing the dependencies.

! pip install alibi-detect import matplotlib import matplotlib.pyplot as plt import numpy as np import os import pandas as pd import seaborn as sns from sklearn.metrics import confusion_matrix, f1_score from alibi_detect.od import IForest from alibi_detect.datasets import fetch_kdd from alibi_detect.utils.data import create_outlier_batch from alibi_detect.utils.fetching import fetch_detector from alibi_detect.utils.saving import save_detector, load_detector from alibi_detect.utils.visualize import plot_instance_score, plot_roc

Now we will load the dataset and define a normal batch of data and normalize the data.

# load data kddcup = fetch_kdd(percent10=True) # only load 10% of the dataset # create normal batch np.random.seed(0) normal_batch = create_outlier_batch(kddcup.data, kddcup.target, n_samples=400000, perc_outlier=0) X_train, y_train = normal_batch.data.astype('float'), normal_batch.target # apply normalization mean, stdev = X_train.mean(axis=0), X_train.std(axis=0) X_train = (X_train - mean) / stdev

Next, we will define an outlier detector.

# initialize outlier detector od = IForest(threshold=None, # threshold for outlier score n_estimators=100) # train od.fit(X_train) # save the trained outlier detector save_detector(od, '/content/')

The above definition will return a warning as:

We still need to establish the outlier threshold, according to the warning. The *infer_threshold* method can be used to accomplish this. We’ll need to pass a batch of instances and use *threshold_perc* to define what percentage of them we consider typical. Assume we have some data with a known percentage of outliers of roughly 5%. In the *create_outlier_batch* function, *perc_outlier* can be used to set the proportion of outliers.

# create batch of outlier np.random.seed(0) perc_outlier = 5 threshold_batch = create_outlier_batch(kddcup.data, kddcup.target, n_samples=1000, perc_outlier=perc_outlier) X_threshold, y_threshold = threshold_batch.data.astype('float'), threshold_batch.target X_threshold = (X_threshold - mean) / stdev print('{}% outliers'.format(100 * y_threshold.mean())) # add threshold to detector od.infer_threshold(X_threshold, threshold_perc=100-perc_outlier) print('New threshold: {}'.format(od.threshold))

Here is the settled outlier and threshold:

Now, similar to before, we construct a batch of data containing 10% outliers and use our detector to find the outliers in the batch.

# new batch for prediction np.random.seed(1) outlier_batch = create_outlier_batch(kddcup.data, kddcup.target, n_samples=1000, perc_outlier=10) X_outlier, y_outlier = outlier_batch.data.astype('float'), outlier_batch.target X_outlier = (X_outlier - mean) / stdev print(X_outlier.shape, y_outlier.shape) print('{}% outliers'.format(100 * y_outlier.mean())) # predicting od_preds = od.predict(X_outlier, return_instance_score=True)

Now to evaluate the performance of this model we will use the confusion matrix, Fa score and accuracy score for the actual outlier and predicted outlier.

labels = outlier_batch.target_names y_pred = od_preds['data']['is_outlier'] f1 = f1_score(y_outlier, y_pred) acc = accuracy_score(y_outlier, y_pred) print('F1 score: {:.4f},\n Accuracy Score: {:.4f}'.format(f1, acc)) cm = confusion_matrix(y_outlier, y_pred) df_cm = pd.DataFrame(cm, index=labels, columns=labels) sns.heatmap(df_cm, annot=True, cbar=True, linewidths=.5) plt.show()

Now we can also plot the instance level outlier scores Vs the outlier threshold for better understanding using the method plot_instance_score.

plot_instance_score(od_preds, y_outlier, labels, od.threshold)

**Final Words**

With an outlier score of about 0, we can see that the isolation forest does not perform a good job of recognizing one type of outlier. This makes determining a good threshold without knowing the outliers type is difficult. Setting the threshold slightly below 0 would result in much-improved detector performance for the dataset’s outliers.

We have talked about how critical it is to spot outliers in our data distribution in this post. In the topic of outlier detection, we’ve encountered Alibi detect, a toolbox that can detect outliers, anomalies, and adversarial attacks in a variety of data formats. We’ve seen how we can use this toolbox to discover outliers in particular.