A guide to end-to-end Anomaly Detection using PyFBAD

The PyFBAD library is an unsupervised anomaly detection package that works from start to finish. All ml-flow phases have source codes in this package.

Advertisement

Essentially, in anomaly detection, we are looking for observations that deviate from the norm, that either outperform or trail what we’ve discovered or defined as normal. Anomaly detection thus provides benefits from both a business and a technical standpoint. To perform the anomaly, one must rely on tools such as SciKit Learn. However, when it comes to performing end-to-end tasks, there are only a few options, such as PyFBAD, a Python-based package. Starting from the beginning, we can load data from various distributed servers to run SOTA algorithms for anomaly detection. We will talk about these tools in this article, but first, we will go over some of the important points listed below.

Table of contents

  1. What is anomaly detection?
  2. Techniques of anomaly detection
  3. Algorithms for anomaly detection
  4. How does PyFBAD deal with anomalies?

Let’s start the discussion by understanding anomaly detection.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

What is anomaly detection?

Anomalies are data points in a dataset that stand out from the rest and contradict the data’s expected behaviour. These data points or observations differ from the dataset’s typical patterns of behaviour. Anomaly detection is a technique for detecting anomalies in a dataset that is based on unsupervised data processing. Anomalies can be classified into several categories, including outliers, outliers, outliers, outliers, outliers, outliers, and outlier Anomaly patterns that appear in data collection in an ad hoc or non-systematic manner. Drifts, Long-term data change that is slow and asymmetric.

Anomaly detection is useful for detecting fraudulent transactions, detecting diseases, and handling case studies with a high-class imbalance. Data science models with more robust anomaly detection techniques can be built. 

Outlier analysis (also known as anomaly detection) is a data mining step that detects data points, events, and/or observations that depart from a dataset’s regular behaviour. An unusual amount of data can disclose essential events, such as a technical glitch, or prospective possibilities, such as a change in consumer behaviour. Anomalies are increasingly being detected using machine learning.

Techniques of anomaly detection

Unsupervised, semi-supervised, and supervised anomaly detection techniques are the three types. The best anomaly detection method is essentially determined by the labels in the dataset. Supervised anomaly detection techniques require a data set with a complete set of “normal” and “abnormal” labels in order for a classification algorithm to work. The classifier must also be trained as part of this method. 

Outlier detection is similar to traditional pattern recognition, with the exception that outlier detection creates a natural strong imbalance between the classes. Because anomaly detection is inherently unbalanced, it is not well-suited to all statistical classification algorithms.

Semi-supervised anomaly detection techniques build a model representing normal behaviour using a normal, labelled training data set. They then use that model to spot anomalies by determining how likely it is for the model to generate any given instance.

Unsupervised anomaly detection methods detect anomalies in an unlabeled test set of data solely based on the data’s intrinsic properties. The working assumption is that the vast majority of the instances in the data set will be normal, as in most cases. The anomaly detection algorithm will then look for instances that don’t seem to fit in with the rest of the data set.

Algorithms for anomaly detection

Isolation Forest

The Isolation Forest algorithm detects anomalies using a tree-based approach. It is based on modelling normal data in order to isolate anomalies that are both few in number and distinct in the feature space. The algorithm essentially accomplishes this as, It generates a Random Forest in which Decision Trees are grown at random: at each node, features are chosen at random, and a random threshold value is chosen to divide the dataset in half.  

It keeps chopping away at the dataset until all instances are isolated from one another. Because an anomaly is usually far away from other instances, it becomes isolated in fewer steps than normal instances on average (across all Decision Trees).

Density-based algorithms

Common density-based techniques include K-Nearest Neighbor (KNN), Local Outlier Factor (LOF), and others. Regression and classification systems can both benefit from these techniques.

Following the line of highest data point density, each of these algorithms generates expected behaviour. Any points that fall outside of these dense zones by a statistically significant amount are flagged as anomalies. Because most of these techniques rely on the distance between points, it’s critical to scale the dataset and normalize the units to ensure accurate results.

SVM based approach

A supervised learning model that yields a robust prediction model is the support vector machine (one-class SVM) technique. It is primarily used for classification. The technique employs a series of training examples, each of which is labelled as belonging to one of two groups. 

The system then produces criteria for categorizing additional cases. To maximize the difference between the two categories, the algorithm translates examples to points in space.

The system identifies a value as an outlier if it is too far outside of either category’s range. If you don’t have labelled data, you can use an unsupervised learning strategy to establish categories by looking for grouping among cases.

How does PyFBAD deal with anomalies?

The PyFBAD library is an unsupervised anomaly detection package that works from start to finish. All ml-flow phases have source codes in this package. With the numerous PyFBAD packages, data can be read from a file such as CSV, databases such as MongoDB, or MySQL. Preprocessing procedures can be used to prepare the read data for the model. 

Different machine learning models, such as Prophet or Isolation Forest, can be used to train the model. Results of anomaly detection can be sent by email or slack. In other words, the entire project cycle can be completed utilizing only the source codes provided by PyFBAD and no other libraries.

Let’s start with this package first we’ll install the package with pip and import all the dependencies, also in this implementation Plotly dash is used for interactive plotting.

import plotly.express as px
import plotly.graph_objects as go
from pyfbad.data import database as db
from pyfbad.models import models as md
from pyfbad.features import create_feature as cf

As we mention this tool as an end-to-end platform, we can leverage our data from advanced databases; this can be done by the database object. Here we are loading a standard CSV file that holds the stock information for Microsoft and it can be loaded as, 

# initialize the connection
connection = db.File()
data = connection.read_from_csv('/content/Microsoft_Stock.csv')
data.head()

For time-series anomaly forecasts we need to create a feature set that contains a date_time and the data on which we want to detect an anomaly. Here in our case, it is the volume of shares.

features = cf.Features()
features_set = features.get_model_data(df=data, time_column_name = 'Date', value_column_name = 'Volume')
features_set

Now next by using this features set generated above, PyFBAD provides a model object by which we can detect anomalies in it. By this time it has Prophet and Isolation Forest as algorithms to work on. 

# initialize the algorithm
models = md.Model_Prophet()
# train algorithm on the features
trained_features = models.train_model(features_set)
# get the anomalies
forecast_anomaly = models.train_forecast(trained_features)

Now we have detected a set of anomalies in our dataset let’s visualize them using Plotly dash as below, the below first graph shows the main series followed by one showing the anomaly point that model has detected.

Final words

Through this article, we discussed the anomaly and how it is important for one to detect and treat it appropriately in order to get proper business solutions. We have discussed briefly the basic techniques and algorithms that are used to deal with it. Lastly, to detect the anomaly present in the dataset we have used the Python-based toolbox PyFBAD.

References   

More Great AIM Stories

Vijaysinh Lendave
Vijaysinh is an enthusiast in machine learning and deep learning. He is skilled in ML algorithms, data manipulation, handling and visualization, model building.

Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MORE FROM AIM
Amit Raja Naik
Oh boy, is JP Morgan wrong?

The global brokerage firm has downgraded Tata Consultancy Services, HCL Technology, Wipro, and L&T Technology to ‘underweight’ from ‘neutral’ and slashed its target price by 15-21 per cent.