MITB Banner

Top 10 Datasets For Cybersecurity Projects

Share

The techniques of machine learning have been found to be an attractive tool in cybersecurity methods, such as primary fraud detection, finding malicious acts, among others. Besides these use cases, machine learning can be used in various other cybersecurity use-cases, including malicious pdf detection, detecting malware domains, intrusion detection, detecting mimicry attacks and more. 

Below here, we listed the top 10 datasets, in no particular order, that you can use in your next cybersecurity project.

MAWILab

About: MAWILab is a database that assists researchers to evaluate their traffic anomaly detection methods. It consists of a set of labels locating traffic anomalies in the MAWI archive. The labels are obtained using an advanced graph-based methodology that compares and combines different and independent anomaly detectors. The dataset is daily updated to include new traffic from upcoming applications and anomalies.

Get the data here.

Malware Training Sets

About: Malware Training Sets is a machine learning dataset that aims to provide a useful and classified dataset to researchers who want to investigate deeper in malware analysis by using Machine Learning techniques. This dataset is one of the recommended classified datasets for malware analysis.

Get the data here.

Comprehensive, Multi-Source Cyber-Security Events

About: This data set represents 58 consecutive days of de-identified event data collected from five sources within Los Alamos National Laboratory’s corporate, internal computer network. In total, the data set is approximately 12 gigabytes compressed across the five data elements and presents 1,648,275,307 events in total for 12,425 users, 17,684 computers, and 62,974 processes. The data sources include Windows-based authentication events from both individual computers and centralised Active Directory domain controller servers.

Get the data here.

Malicious URLs

About: This dataset includes examples of malicious URLs from a large webmail provider, whose live, real-time feed supplies 6,000-7,500 examples of spam and phishing URLs per day. The malicious URLs are extracted from email messages that users manually label as spam, run through pre-filters to extract easily-detected false positives, and then verified manually as malicious. The data set consists of about 2.4 million URLs (examples) and 3.2 million features.  

Get the data here.

ADFA Intrusion Detection Datasets

About: The ADFA Intrusion Detection datasets are designed for evaluation by system call based HIDS. It includes contemporary datasets for Linux and Windows. The ADFA Linux Dataset (ADFA-LD) provides a contemporary Linux dataset for evaluation by traditional HIDS, and the ADFA Windows Dataset (ADFA-WD) provides a contemporary Windows dataset for evaluation by HIDS.

Get the data here.

Unified Host and Network Data Set

About: The Unified Host and Network Dataset is a subset of network and computer (host) events collected from the Los Alamos National Laboratory enterprise network over the course of approximately 90 days. The data is provided in CSV format and is in the form of time, duration, SrcDevice, DstDevice, Protocol, SrcPort, DstPort, SrcPackets, DstPackets, SrcBytes, etc.

Get the data here.

User-Computer Authentication Associations in Time

About: User-Computer Authentication Associations in Time is an anonymised dataset that encompasses nine continuous months and represents 708,304,516 successful authentication events from users to computers collected from the Los Alamos National Laboratory (LANL) enterprise network. There are 11,362 users within the dataset and 22,284 computers represented as U plus an anonymised, unique number, and C plus an anonymised, unique number respectively. 

Get the data here.

CTU-13 Dataset

About: The CTU-13 is a dataset of botnet traffic that was captured in the CTU University, Czech Republic. The goal of this dataset is to have a large capture of real botnet traffic mixed with normal traffic and background traffic. The CTU-13 dataset consists of thirteen captures, known as scenarios of different botnet samples.

Get the data here.

Aposemat IoT-23

About: Aposemat IoT-23 is a labelled dataset with malicious and benign IoT network traffic. It is a dataset of network traffic from the Internet of Things (IoT) devices and has 20 malware captures executed in IoT devices, and three captures for benign IoT devices traffic. The IoT-23 dataset consists of twenty-three captures (called scenarios) of different IoT network traffic. 

Get the data here.

EMBER

About: Endgame Malware BEnchmark for Research or the EMBER dataset is a collection of features from PE files that serve as a benchmark dataset for researchers. It is an open dataset for training machine learning models to statically detect malicious Windows portable executable files. The dataset includes features extracted from 1.1M binary files: 900K training samples (300K malicious, 300K benign, 300K unlabeled) and 200K test samples (100K malicious, 100K benign).

Get the data here.

Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.