Machine learning and data science hackathon platforms like Kaggle and MachineHack are testbeds for AI/ML enthusiasts to explore, analyse and share quality data.
However, finding a suitable dataset can be tricky. As per the Kaggle website, there are over 50,000 public datasets and 400,000 public notebooks available. Every day a new dataset is uploaded on Kaggle. Each dataset is a small community where one can discuss data, find relevant public code or create your projects in Kernels. Sometimes, you can also find notebooks with algorithms that solve the prediction problem in a specific dataset.
Here are some of the most popular datasets on Kaggle.
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Credit Card Fraud Detection
This dataset helps companies and teams recognise fraudulent credit card transactions. The dataset contains transactions made by European credit cardholders in September 2013. The dataset presents details of 284,807 transactions, including 492 frauds, that happened over two days.
Recently, it released a simulator for transaction data as part of the practical handbook on machine learning for credit card fraud detection.
Download our Mobile App
European Soccer Database
It is the ultimate soccer dataset for data analysis and machine learning. The dataset contains 25,000+ matches, 10,000+ players, 11 European countries with their lead championship, seasons 2008 to 2016, players and teams’ attributes sourced from EA Sports’ FIFA video game series, including weekly updates, team line up with squad formation (X, Y coordinates), betting odds from up to 10 providers, detailed match events (goal types, corner, possession, fouls, etc.) for 10,000+ matches.
The dataset shows the historical data on avocado prices and sales volume in multiple US markets. The information has been generated from the Hass Avocado Board website. It represents weekly 2018 retail scan data for national retail volume (units and price, along with region, types (conventional or organic), and Avocado sold volume. The dataset can be applied to other fruits and vegetables across geographies.
IBM HR Analytics Employee Attrition & Performance
Created by IBM data scientists, this fictional dataset is used to predict attrition in an organisation. It uncovers various factors that lead to employee attrition and explores correlations such as “a breakdown of distance from home by job role and attrition,’ or ‘comparison of average monthly income by education and attrition.’
Red Wine Quality
Red wine quality is a clean and straightforward practice dataset for regression or classification modelling. The two datasets available are related to red and white variants of the Portuguese ‘Vinho Verde’ wine. The information in this dataset includes fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH and others. The dataset is also available on the UCI machine learning repository.
Medical Cost Personal Datasets
This dataset is used for forecasting insurance via regression modelling. The dataset includes age, sex, body mass index, children (dependents), smoker, region and charges (individual medical costs billed by health insurance). The dataset is also available on GitHub.
Open Food Facts
This is a free, open, collaborative database of food products worldwide, with ingredients, allergens, nutrition facts and all the tidbits of information found on product labels. The database is a part of Google’s Summer of Code 2018. 5000+ contributors have added 600K+ products from 150 countries using an app or their camera to scan barcodes and upload pictures of products and their labels.
Machine Learning & Data Science Survey
Kaggle conducted an industry-wide survey in 2017 to establish a comprehensive overview of the data science and machine learning landscape. The survey received over 16K responses, gathering information around data science, machine learning innovation, how to become data scientists and more. You can find the kernels used in the report here.
The Titanic dataset consists of original data from the Titanic competition and is ideal for binary logistic regression. The dataset contains information about the passenger’s id, age, sex, fare etc. The Titanic competition involves users creating a machine learning model that predicts which passengers survived the Titanic shipwreck.
Annotated Corpus for Named Entity Recognition
This dataset is extracted from the GMB (Groningen Meaning Bank) corpus, tagged, annotated and built specifically to train the classifier to predict labelled entities such as name, location, etc. It gives you a broad view of feature engineering and helps solve business problems like picking entities from electronic medical records, etc.
Check out other popular datasets on Kaggle here.