Augmenting Cyber Security: How To Use Big Data Analytics For Anomaly Detection

Cyber attacks are increasing at a rapid pace and are now systematically targeted towards vulnerable countries. The growing number of online users and their data is further worsening the situation. In addition to this, there is an interconnection between digital usage and the users in critical business sectors such as banking, which is giving rise to cyber crimes more than ever before.

Big data technologies have emerged extensively as a result of this data burst. However, it has given rise to concerns on the security aspects such as methods of data storage, their systems and real-time monitoring, amongst other challenges. Nonetheless, big data solutions are still preferred due to their advantage of handling voluminous amounts of data in short span of time. In this article, we will discuss a study conducted by academics at Gazi University, Turkey, where a novel method called ‘unsupervised anomaly detection approach’ in networks is designed to address vulnerabilities in a big data network using data from NetFlow.   


Sign up for your weekly dose of what's up in emerging technology.

Why Use NetFlow?

NetFlow is a network protocol which was first introduced by Cisco in order to monitor and collect network traffic data among devices as well as computer software and applications. The protocol is primarily used to detect anomalies and patterns where dangerous information can be circulated to devices and compromise security. More importantly, it offers computer security experts a view to understand the behavior of the traffic flow. In addition, network anomaly detection can be done using many methods with most of them following machine learning techniques.

The Approach To Anomaly Detection

In the study, clustering aspects of machine learning is considered to detect network anomalies. Academics, say that the approach follows the following six steps:

Download our Mobile App

  1. Firstly, NetFlows are divided into intervals. Most actions show similar behavior in several minutes
  2. Netflows are then aggregated according to source IPs.
    • The data size is reduced for processing
    • The aggregated data may show new patterns to detect behavior.
  3. The obtained data is standardized by zero score as in the equation z=(x – μ)/σ where μ is the mean and σ is the standard deviation.
    • This procedure equalises the data variability.
    • Standardised data is less affected by outlier.
  4. The aggregated NetFlows are clustered based on the k-means algorithm as distributed.
    • The unsupervised techniques trained with unlabeled data has the ability to detect unfamiliar attacks.
    • It is predicted that clusters will occur according to normal or abnormal traffic behavior.
  5. The Euclidean distance of the cluster elements to the cluster center is calculated.
    • The elements in the cluster should be close to the center for a good clustering.
    • The elements may be abnormally distant from the center because of any reason and the centroids can be used for outlier detection.
    • The histogram is used to understand the distribution of distance of the elements from the center
    • The elements stay distant from the concentrated region on the histogram are considered as anomalous.
  6. The actual normal and abnormal flow numbers are determined from time intervals in steps 4 and 5. Finally, the success criterion is evaluated.

Cloud Environment And The Datasets For Implementation

For the study, Netflow data was implemented on Apache Spark big data framework with Azure HDInsight cloud service for processing data. Python was used as the main programming language. In order to detect network attacks, CTU-13 dataset was investigated since it provided sample attack scenarios to ascertain network behavior. Specifically, the 10th Scenario in the dataset (UDP DDos attacks) was the focus of the study due to the fact that it covered botnet attacks in addition to being large in size (It had 13,09,792 netflows with 1,06,352 UDP DDos flows).

The implementation follows the approach mentioned earlier. The Netflow data was split into one minute time intervals to capture anomalies so that the data is not crowded with anomalies for experimentation. With this, the unsupervised anomaly detection was developed. The detailed information can be found here. The accuracy of the detection was found to be 96 percent correct. In order to visualise the accuracy, the six features in the dataset for the study was reduced to three-dimensions using dimensionality reduction with principal component analysis(PCA).

The figures show anomalies in data. The blue color indicates normal traffic and the red color indicates abnormal traffic (possible botnet attacks). The first figure depicts regular monitoring of anomalies while the second figure depicts anomalies using the said method. (Image courtesy: Duygu Sinanc & Seref Sagiroglu)



As technology is rising in parallel, cyber crimes are committed with more ease and deception. It is sometimes harder to detect censure, owing to anonymity and other tricky methods harbored by cyber-criminals. This study will definitely serve beneficial for future avenues to counter attacks on computer networks using big data and machine learning.

More Great AIM Stories

Abhishek Sharma
I research and cover latest happenings in data science. My fervent interests are in latest technology and humor/comedy (an odd combination!). When I'm not busy reading on these subjects, you'll find me watching movies or playing badminton.

AIM Upcoming Events

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Early Bird Passes expire on 10th Feb

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox