Machine learning techniques play a critical role in detecting serious threats in the network. A good dataset helps create robust machine learning systems to address various network security problems, malware attacks, phishing, and host intrusion. For instance, the real-world cybersecurity datasets will help you work in projects like network intrusion detection system, network packet inspection system, etc, using machine learning models.
Here is a list of the 8 top cybersecurity datasets you can use for your next machine learning project.
(The list is in no particular order)
1| ADFA Intrusion Detection Datasets
About: The ADFA Intrusion Detection Datasets are designed for the evaluation by system call based HIDS. The datasets cover both Linux and Windows and help in detecting anomaly-based intrusions on both Linux and Windows. The datasets are used as a benchmarking for traditional Host Based Intrusion Detection System (HIDS).
Know more here.
2| ISOT Botnet and Ransomware Detection Datasets
About: The ISOT Botnet dataset is a combination of several existing publicly available malicious and non-malicious datasets. The ISOT Ransomware Detection dataset consists of over 420 GB of ransomware and benign programmes execution traces. The ISOT HTTP botnet dataset comprises two traffic captures: malicious DNS data for nine different botnets and benign DNS for 19 different well-known software applications.
Know more here.
3| FakeNewsNet
About: FakeNewsNet is a fake news data repository, which contains two comprehensive datasets with diverse features in news content, social context, and spatiotemporal information. The dataset is constructed using an end-to-end system called FakeNewsTracker. The data repository can boost the study of various open research problems related to fake news study.
Know more here.
4| Malicious URLs Dataset
About: The Malicious URLs dataset consists of about 2.4 million URLs (examples) and 3.2 million features. The datasets are available in two types, Matlab and SVM-light. In Matlab format, the file url.mat contains FeatureTypes, a list of column indices for the data matrices that are real-valued features. In SVM-light format, the FeatureTypes is a text file list of feature indices that correspond to real-valued features.
Know more here.
5| ISOT Cloud Intrusion Detection (ISOT CID) Dataset
About: The ISOT Cloud IDS (ISOT CID) dataset consists of over 8Tb data collected in a real cloud environment and includes network traffic at VM and hypervisor levels, system logs, performance data (e.g. CPU utilisation), and system calls. The ISOT-CID is a collection of different data accumulated from various cloud layers, including guest hosts, hypervisors, and networks. The dataset comprises data with different formats and multiple data sources, including memory dumps, resource (e.g., CPU) utilisation logs, system call traces, system logs, and network traffic.
Know more here.
6| Behavioral Biometric Datasets
About: The ISOT Behavioral Biometric dataset consists of four types of datasets, which are mouse dynamics dataset, mouse gesture dynamics dataset, combined mouse/keystroke dynamics/site actions dataset and mobile keystroke dynamics OTP dataset. The ISOT mouse dynamics dataset consists of mouse dynamics data for 48 users collected over several months. The Mouse Gesture Dynamics dataset consists of genuine gesture data drawn by 41 individuals and forgery data against 25 different individuals.
The Combined Mouse/Keystroke Dynamics/Site Actions dataset consists of the mouse, keystroke, and site actions (menus) for 24 different users visiting a website and using the site freely (in continuous mode; not static). The dataset includes both genuine samples, and attack data, where some of the users tried to forge the sessions of actual users. Lastly, the Mobile Keystroke Dynamics OTP dataset consists of mobile keystroke dynamic data collected from about 100 users providing both a fixed password and an OTP during login.
Know more here.
7| ISOT Fake News Dataset
About: The ISOT Fake News dataset is a compilation of several thousand fake news and truthful articles obtained from different legitimate news sites and sites flagged as unreliable by Politifact.com. The dataset contains two types of articles, fake and real news. This dataset was collected from real-world sources, where truthful articles were obtained by crawling Reuters.com.
Know more here.
8| Dynamic Malware Analysis Kernel and User-Level Calls
About: The Dynamic Malware Analysis Kernel and User-Level Calls dataset contain the data collected from Cuckoo and a kernel driver after running 1000 malicious and 1000 clean samples. The Kernel Driver folder contains subfolders that hold the API-calls from clean and malicious data.
Know more here.