The techniques of machine learning have been found to be an attractive tool in cybersecurity methods, such as primary fraud detection, finding malicious acts, among others. Besides these use cases, machine learning can be used in various other cybersecurity use-cases, including malicious pdf detection, detecting malware domains, intrusion detection, detecting mimicry attacks and more.
Below here, we listed the top 10 datasets, in no particular order, that you can use in your next cybersecurity project.
MAWILab
About: MAWILab is a database that assists researchers to evaluate their traffic anomaly detection methods. It consists of a set of labels locating traffic anomalies in the MAWI archive. The labels are obtained using an advanced graph-based methodology that compares and combines different and independent anomaly detectors. The dataset is daily updated to include new traffic from upcoming applications and anomalies.
Get the data here.
Malware Training Sets
About: Malware Training Sets is a machine learning dataset that aims to provide a useful and classified dataset to researchers who want to investigate deeper in malware analysis by using Machine Learning techniques. This dataset is one of the recommended classified datasets for malware analysis.
Get the data here.
Comprehensive, Multi-Source Cyber-Security Events
About: This data set represents 58 consecutive days of de-identified event data collected from five sources within Los Alamos National Laboratory’s corporate, internal computer network. In total, the data set is approximately 12 gigabytes compressed across the five data elements and presents 1,648,275,307 events in total for 12,425 users, 17,684 computers, and 62,974 processes. The data sources include Windows-based authentication events from both individual computers and centralised Active Directory domain controller servers.
Get the data here.
Malicious URLs
About: This dataset includes examples of malicious URLs from a large webmail provider, whose live, real-time feed supplies 6,000-7,500 examples of spam and phishing URLs per day. The malicious URLs are extracted from email messages that users manually label as spam, run through pre-filters to extract easily-detected false positives, and then verified manually as malicious. The data set consists of about 2.4 million URLs (examples) and 3.2 million features.
Get the data here.
ADFA Intrusion Detection Datasets
About: The ADFA Intrusion Detection datasets are designed for evaluation by system call based HIDS. It includes contemporary datasets for Linux and Windows. The ADFA Linux Dataset (ADFA-LD) provides a contemporary Linux dataset for evaluation by traditional HIDS, and the ADFA Windows Dataset (ADFA-WD) provides a contemporary Windows dataset for evaluation by HIDS.
Get the data here.
Unified Host and Network Data Set
About: The Unified Host and Network Dataset is a subset of network and computer (host) events collected from the Los Alamos National Laboratory enterprise network over the course of approximately 90 days. The data is provided in CSV format and is in the form of time, duration, SrcDevice, DstDevice, Protocol, SrcPort, DstPort, SrcPackets, DstPackets, SrcBytes, etc.
Get the data here.
User-Computer Authentication Associations in Time
About: User-Computer Authentication Associations in Time is an anonymised dataset that encompasses nine continuous months and represents 708,304,516 successful authentication events from users to computers collected from the Los Alamos National Laboratory (LANL) enterprise network. There are 11,362 users within the dataset and 22,284 computers represented as U plus an anonymised, unique number, and C plus an anonymised, unique number respectively.
Get the data here.
CTU-13 Dataset
About: The CTU-13 is a dataset of botnet traffic that was captured in the CTU University, Czech Republic. The goal of this dataset is to have a large capture of real botnet traffic mixed with normal traffic and background traffic. The CTU-13 dataset consists of thirteen captures, known as scenarios of different botnet samples.
Get the data here.
Aposemat IoT-23
About: Aposemat IoT-23 is a labelled dataset with malicious and benign IoT network traffic. It is a dataset of network traffic from the Internet of Things (IoT) devices and has 20 malware captures executed in IoT devices, and three captures for benign IoT devices traffic. The IoT-23 dataset consists of twenty-three captures (called scenarios) of different IoT network traffic.
Get the data here.
EMBER
About: Endgame Malware BEnchmark for Research or the EMBER dataset is a collection of features from PE files that serve as a benchmark dataset for researchers. It is an open dataset for training machine learning models to statically detect malicious Windows portable executable files. The dataset includes features extracted from 1.1M binary files: 900K training samples (300K malicious, 300K benign, 300K unlabeled) and 200K test samples (100K malicious, 100K benign).
Get the data here.