Advertisement

Active Hackathon

Augmenting Cyber Security: How To Use Big Data Analytics For Anomaly Detection

Cyber attacks are increasing at a rapid pace and are now systematically targeted towards vulnerable countries. The growing number of online users and their data is further worsening the situation. In addition to this, there is an interconnection between digital usage and the users in critical business sectors such as banking, which is giving rise to cyber crimes more than ever before.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Big data technologies have emerged extensively as a result of this data burst. However, it has given rise to concerns on the security aspects such as methods of data storage, their systems and real-time monitoring, amongst other challenges. Nonetheless, big data solutions are still preferred due to their advantage of handling voluminous amounts of data in short span of time. In this article, we will discuss a study conducted by academics at Gazi University, Turkey, where a novel method called ‘unsupervised anomaly detection approach’ in networks is designed to address vulnerabilities in a big data network using data from NetFlow.   

Why Use NetFlow?

NetFlow is a network protocol which was first introduced by Cisco in order to monitor and collect network traffic data among devices as well as computer software and applications. The protocol is primarily used to detect anomalies and patterns where dangerous information can be circulated to devices and compromise security. More importantly, it offers computer security experts a view to understand the behavior of the traffic flow. In addition, network anomaly detection can be done using many methods with most of them following machine learning techniques.

The Approach To Anomaly Detection

In the study, clustering aspects of machine learning is considered to detect network anomalies. Academics, say that the approach follows the following six steps:

  1. Firstly, NetFlows are divided into intervals. Most actions show similar behavior in several minutes
  2. Netflows are then aggregated according to source IPs.
    • The data size is reduced for processing
    • The aggregated data may show new patterns to detect behavior.
  3. The obtained data is standardized by zero score as in the equation z=(x – μ)/σ where μ is the mean and σ is the standard deviation.
    • This procedure equalises the data variability.
    • Standardised data is less affected by outlier.
  4. The aggregated NetFlows are clustered based on the k-means algorithm as distributed.
    • The unsupervised techniques trained with unlabeled data has the ability to detect unfamiliar attacks.
    • It is predicted that clusters will occur according to normal or abnormal traffic behavior.
  5. The Euclidean distance of the cluster elements to the cluster center is calculated.
    • The elements in the cluster should be close to the center for a good clustering.
    • The elements may be abnormally distant from the center because of any reason and the centroids can be used for outlier detection.
    • The histogram is used to understand the distribution of distance of the elements from the center
    • The elements stay distant from the concentrated region on the histogram are considered as anomalous.
  6. The actual normal and abnormal flow numbers are determined from time intervals in steps 4 and 5. Finally, the success criterion is evaluated.

Cloud Environment And The Datasets For Implementation

For the study, Netflow data was implemented on Apache Spark big data framework with Azure HDInsight cloud service for processing data. Python was used as the main programming language. In order to detect network attacks, CTU-13 dataset was investigated since it provided sample attack scenarios to ascertain network behavior. Specifically, the 10th Scenario in the dataset (UDP DDos attacks) was the focus of the study due to the fact that it covered botnet attacks in addition to being large in size (It had 13,09,792 netflows with 1,06,352 UDP DDos flows).

The implementation follows the approach mentioned earlier. The Netflow data was split into one minute time intervals to capture anomalies so that the data is not crowded with anomalies for experimentation. With this, the unsupervised anomaly detection was developed. The detailed information can be found here. The accuracy of the detection was found to be 96 percent correct. In order to visualise the accuracy, the six features in the dataset for the study was reduced to three-dimensions using dimensionality reduction with principal component analysis(PCA).

The figures show anomalies in data. The blue color indicates normal traffic and the red color indicates abnormal traffic (possible botnet attacks). The first figure depicts regular monitoring of anomalies while the second figure depicts anomalies using the said method. (Image courtesy: Duygu Sinanc & Seref Sagiroglu)

 

Conclusion

As technology is rising in parallel, cyber crimes are committed with more ease and deception. It is sometimes harder to detect censure, owing to anonymity and other tricky methods harbored by cyber-criminals. This study will definitely serve beneficial for future avenues to counter attacks on computer networks using big data and machine learning.

More Great AIM Stories

Abhishek Sharma
I research and cover latest happenings in data science. My fervent interests are in latest technology and humor/comedy (an odd combination!). When I'm not busy reading on these subjects, you'll find me watching movies or playing badminton.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR

Data Science Skills Survey 2022 – By AIM and Great Learning

Data science and its applications are becoming more common in a rapidly digitising world. This report presents a comprehensive view to all the stakeholders — students, professionals, recruiters, and others — about the different key data science tools or skillsets required to start or advance a career in the data science industry.

How to Kill Google Play Monopoly

The only way to break Google’s monopoly is to have localised app stores with an interface as robust as Google’s – and this isn’t an easy ask. What are the options?

[class^="wpforms-"]
[class^="wpforms-"]