MITB Banner

10 Most Popular Datasets On Kaggle

Share

Machine learning and data science hackathon platforms like Kaggle and MachineHack are testbeds for AI/ML enthusiasts to explore, analyse and share quality data. 

However, finding a suitable dataset can be tricky. As per the Kaggle website, there are over 50,000 public datasets and 400,000 public notebooks available. Every day a new dataset is uploaded on Kaggle. Each dataset is a small community where one can discuss data, find relevant public code or create your projects in Kernels. Sometimes, you can also find notebooks with algorithms that solve the prediction problem in a specific dataset. 

Here are some of the most popular datasets on Kaggle.

Credit Card Fraud Detection

This dataset helps companies and teams recognise fraudulent credit card transactions. The dataset contains transactions made by European credit cardholders in September 2013. The dataset presents details of 284,807 transactions, including 492 frauds, that happened over two days.

Recently, it released a simulator for transaction data as part of the practical handbook on machine learning for credit card fraud detection

European Soccer Database

It is the ultimate soccer dataset for data analysis and machine learning. The dataset contains 25,000+ matches, 10,000+ players, 11 European countries with their lead championship, seasons 2008 to 2016, players and teams’ attributes sourced from EA Sports’ FIFA video game series, including weekly updates, team line up with squad formation (X, Y coordinates), betting odds from up to 10 providers, detailed match events (goal types, corner, possession, fouls, etc.) for 10,000+ matches.

Avocado Prices

The dataset shows the historical data on avocado prices and sales volume in multiple US markets. The information has been generated from the Hass Avocado Board website. It represents weekly 2018 retail scan data for national retail volume (units and price, along with region, types (conventional or organic), and Avocado sold volume. The dataset can be applied to other fruits and vegetables across geographies. 

IBM HR Analytics Employee Attrition & Performance

Created by IBM data scientists, this fictional dataset is used to predict attrition in an organisation. It uncovers various factors that lead to employee attrition and explores correlations such as “a breakdown of distance from home by job role and attrition,’ or ‘comparison of average monthly income by education and attrition.’ 

Red Wine Quality 

Red wine quality is a clean and straightforward practice dataset for regression or classification modelling. The two datasets available are related to red and white variants of the Portuguese ‘Vinho Verde’ wine. The information in this dataset includes fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH and others. The dataset is also available on the UCI machine learning repository.

Medical Cost Personal Datasets

This dataset is used for forecasting insurance via regression modelling. The dataset includes age, sex, body mass index, children (dependents), smoker, region and charges (individual medical costs billed by health insurance). The dataset is also available on GitHub

Open Food Facts 

This is a free, open, collaborative database of food products worldwide, with ingredients, allergens, nutrition facts and all the tidbits of information found on product labels. The database is a part of Google’s Summer of Code 2018. 5000+ contributors have added 600K+ products from 150 countries using an app or their camera to scan barcodes and upload pictures of products and their labels. 

Machine Learning & Data Science Survey

Kaggle conducted an industry-wide survey in 2017 to establish a comprehensive overview of the data science and machine learning landscape. The survey received over 16K responses, gathering information around data science, machine learning innovation, how to become data scientists and more. You can find the kernels used in the report here

Titanic

The Titanic dataset consists of original data from the Titanic competition and is ideal for binary logistic regression. The dataset contains information about the passenger’s id, age, sex, fare etc. The Titanic competition involves users creating a machine learning model that predicts which passengers survived the Titanic shipwreck. 

Annotated Corpus for Named Entity Recognition

This dataset is extracted from the GMB (Groningen Meaning Bank) corpus, tagged, annotated and built specifically to train the classifier to predict labelled entities such as name, location, etc. It gives you a broad view of feature engineering and helps solve business problems like picking entities from electronic medical records, etc.

Check out other popular datasets on Kaggle here

Share
Picture of Amit Raja Naik

Amit Raja Naik

Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.