“Anomaly detection has great significance in detecting fake profiles in Social Networks like Twitter, Facebook, Amazon reviews, and even financial frauds.”
For this week’s ML practitioner’s series, Analytics India Magazine got in touch with Siddharth Bhatia, who is into machine learning research at National University of Singapore (NUS). His work focuses majorly on Streaming Anomaly Detection. At NUS, he is supported by a President’s Graduate Fellowship. He has also been previously recognised as a young researcher in the ACM Heidelberg Laureate Forum. In this interview, Bhatia will talk about his research and shares a few tips for beginners.
AIM:How important is a PhD in ML?
Siddharth: After completing the masters in Computer Science from BITS Pilani, I started my Ph.D. in Streaming Anomaly Detection to detect anomalies in real-time so as to minimise the harm caused by them. For someone wanting to go into academia or even a research lab in the industry, PhD is kind of a necessity, though it is possible that someone does good quality research in ML/AI without a PhD. ML research is not for a select few but the masses. Also, there are a lot of misconceptions caused because of the hype.
AIM: What made you curious about anomaly detection?
Siddharth: Anomaly detection is a critical problem for finding suspicious behaviour in many systems. Intrusion detection, fake ratings, and financial transaction fraud are few of the many examples. That being said, anomaly detection has been well researched over the years — the majority of the approaches proposed in anomaly detection focus on static graphs. However, real-world graphs are usually dynamic in nature, and current approaches can risk missing temporal characteristics of the graphs and anomalies due to their dynamic nature.
Coming to dynamic graphs, most of them aggregate edges into graph snapshots. To recover from malicious activities as soon as possible, we need them in real-time or near real-time.
Fraudulent or anomalous events typically occur in microclusters. In other words, they occur in groups of suspiciously similar edges. Considering, the number of vertices can increase as we process the stream of edges, we require an algorithm which leverages constant memory in the graph size. Existing methods that process edge streams in an online manner aim to detect individually surprising edges, not microclusters, and can thus miss large amounts of suspicious activity.
To address this, we proposed MIDAS (Microcluster-Based Detector of Anomalies in Edge Streams). The goal here is to detect microcluster anomalies. Additionally, by using a principled hypothesis testing framework, MIDAS provides theoretical bounds on the false positive probability, unlike previous methods.
AIM: Can you tell us about your research methodology?
Siddharth: Our approach, MIDAS, finds anomalous edges from a dynamic graph in a streaming manner. The idea here is to combine a chi-squared goodness-of-fit test with the Count-Min-Sketch (CMS) streaming data structures. This, in turn, would get an anomaly score for each edge. Our method also incorporates temporal and spatial relations to achieve better performance.
For instance, recent intrusion detection datasets typically report tens of features such as its source and destination IP, protocol, average packet size, etc. It is important to design approaches that can handle multi-aspect data. In the intrusion detection setting, MIDAS treats all variables of the dataset as categorical variables, whereas we also want to handle arbitrary mixtures of categorical variables (e.g. source IP address) and numerical variables (e.g. average packet size). We, therefore, propose MSTREAM, a method for processing a stream of multi-aspect data like event-logs and multi-attributed graphs, that detects group anomalies and incorporates correlations between the features.
For our research, we have mainly used Python for coding; sci-kit learn library frequently. And, in the case of frameworks, I prefer PyTorch. Whereas AWS happens to be my go-to cloud option.
AIM: What are the key findings from your research?
Siddharth: MIDAS uses unsupervised machine learning to detect anomalies in a streaming manner in real-time. This approach was designed to address the recent sophisticated attacks. Our approach can be used to detect intrusions, Denial of Service (DoS), Distributed Denial of Service (DDoS) attacks, financial fraud, and fake ratings. MIDAS provides theoretical guarantees on the false positives and is three orders of magnitude faster than the existing state of the art solutions.
The findings of our work on MIDAS can be summarised as follows:
- We could detect microcluster anomalies or suspiciously similar edges.
- We give theoretical guarantees on the false positive probability.
- MIDAS is independent of the graph size.
- MIDAS allows for real-time anomaly detection.
- It is scalable and can process up to 4 million edges in less than 1 second on a normal laptop.
- It is up to 48% more accurate and 644 times faster than the state of the art approaches.
AIM: How do you plan on taking this research forward?
Siddharth: We have open-sourced MIDAS source code and different implementations and are available on the Github project page. So far, many developers have implemented MIDAS in Python, Ruby, Rust, R, and Golang in addition to the C++ version we originally released.
MIDAS is currently being deployed in real-world systems to improve their performance. We have been approached by cybersecurity firms too.
We have extended MIDAS to MStream where we detect anomalies on high dimensional multi-aspect data having both categorical and numeric attributes. In terms of both accuracy and running time, MStream outperformed several baselines including popular scikit-learn algorithms like Isolation Forest and Local Outlier Factor.
MIDAS can also be used to detect fake profiles in Social Networks like Twitter, Facebook, Amazon reviews, and Financial Frauds.
AIM: Can you recommend some resources to know more about this field?
Siddharth: There are some nice surveys which give an extensive overview of the field:
- “Graph-based anomaly detection and description: a survey” by Akoglu, Leman, Hanghang Tong, and Danai Koutra.
- “Tensor-based anomaly detection: An interdisciplinary survey” by Fanaee-T, Hadi, and Joao Gama.
- “Deep learning for anomaly detection: A survey” by Chalapathy, Raghavendra, and Sanjay Chawla.