10 Most Popular Datasets On Kaggle

Machine learning and data science hackathon platforms like Kaggle and MachineHack are testbeds for AI/ML enthusiasts to explore, analyse and share quality data. 

However, finding a suitable dataset can be tricky. As per the Kaggle website, there are over 50,000 public datasets and 400,000 public notebooks available. Every day a new dataset is uploaded on Kaggle. Each dataset is a small community where one can discuss data, find relevant public code or create your projects in Kernels. Sometimes, you can also find notebooks with algorithms that solve the prediction problem in a specific dataset. 

Here are some of the most popular datasets on Kaggle.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Credit Card Fraud Detection

This dataset helps companies and teams recognise fraudulent credit card transactions. The dataset contains transactions made by European credit cardholders in September 2013. The dataset presents details of 284,807 transactions, including 492 frauds, that happened over two days.

Recently, it released a simulator for transaction data as part of the practical handbook on machine learning for credit card fraud detection


Download our Mobile App



European Soccer Database

It is the ultimate soccer dataset for data analysis and machine learning. The dataset contains 25,000+ matches, 10,000+ players, 11 European countries with their lead championship, seasons 2008 to 2016, players and teams’ attributes sourced from EA Sports’ FIFA video game series, including weekly updates, team line up with squad formation (X, Y coordinates), betting odds from up to 10 providers, detailed match events (goal types, corner, possession, fouls, etc.) for 10,000+ matches.

Avocado Prices

The dataset shows the historical data on avocado prices and sales volume in multiple US markets. The information has been generated from the Hass Avocado Board website. It represents weekly 2018 retail scan data for national retail volume (units and price, along with region, types (conventional or organic), and Avocado sold volume. The dataset can be applied to other fruits and vegetables across geographies. 

IBM HR Analytics Employee Attrition & Performance

Created by IBM data scientists, this fictional dataset is used to predict attrition in an organisation. It uncovers various factors that lead to employee attrition and explores correlations such as “a breakdown of distance from home by job role and attrition,’ or ‘comparison of average monthly income by education and attrition.’ 

Red Wine Quality 

Red wine quality is a clean and straightforward practice dataset for regression or classification modelling. The two datasets available are related to red and white variants of the Portuguese ‘Vinho Verde’ wine. The information in this dataset includes fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH and others. The dataset is also available on the UCI machine learning repository.

Medical Cost Personal Datasets

This dataset is used for forecasting insurance via regression modelling. The dataset includes age, sex, body mass index, children (dependents), smoker, region and charges (individual medical costs billed by health insurance). The dataset is also available on GitHub

Open Food Facts 

This is a free, open, collaborative database of food products worldwide, with ingredients, allergens, nutrition facts and all the tidbits of information found on product labels. The database is a part of Google’s Summer of Code 2018. 5000+ contributors have added 600K+ products from 150 countries using an app or their camera to scan barcodes and upload pictures of products and their labels. 

Machine Learning & Data Science Survey

Kaggle conducted an industry-wide survey in 2017 to establish a comprehensive overview of the data science and machine learning landscape. The survey received over 16K responses, gathering information around data science, machine learning innovation, how to become data scientists and more. You can find the kernels used in the report here

Titanic

The Titanic dataset consists of original data from the Titanic competition and is ideal for binary logistic regression. The dataset contains information about the passenger’s id, age, sex, fare etc. The Titanic competition involves users creating a machine learning model that predicts which passengers survived the Titanic shipwreck. 

Annotated Corpus for Named Entity Recognition

This dataset is extracted from the GMB (Groningen Meaning Bank) corpus, tagged, annotated and built specifically to train the classifier to predict labelled entities such as name, location, etc. It gives you a broad view of feature engineering and helps solve business problems like picking entities from electronic medical records, etc.

Check out other popular datasets on Kaggle here

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Amit Raja Naik
Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

Is Foxconn Conning India?

Most recently, Foxconn found itself embroiled in controversy when both Telangana and Karnataka governments simultaneously claimed Foxconn to have signed up for big investments in their respective states