Now Reading
10 Most Popular Datasets On Kaggle

10 Most Popular Datasets On Kaggle

  • Kaggle has over 50,000 public datasets and 400,000 public notebooks.

Machine learning and data science hackathon platforms like Kaggle and MachineHack are testbeds for AI/ML enthusiasts to explore, analyse and share quality data. 

However, finding a suitable dataset can be tricky. As per the Kaggle website, there are over 50,000 public datasets and 400,000 public notebooks available. Every day a new dataset is uploaded on Kaggle. Each dataset is a small community where one can discuss data, find relevant public code or create your projects in Kernels. Sometimes, you can also find notebooks with algorithms that solve the prediction problem in a specific dataset. 

Deep Learning DevCon 2021 | 23-24th Sep | Register>>

Here are some of the most popular datasets on Kaggle.

Credit Card Fraud Detection

This dataset helps companies and teams recognise fraudulent credit card transactions. The dataset contains transactions made by European credit cardholders in September 2013. The dataset presents details of 284,807 transactions, including 492 frauds, that happened over two days.

Recently, it released a simulator for transaction data as part of the practical handbook on machine learning for credit card fraud detection

Looking for a job change? Let us help you.

European Soccer Database

It is the ultimate soccer dataset for data analysis and machine learning. The dataset contains 25,000+ matches, 10,000+ players, 11 European countries with their lead championship, seasons 2008 to 2016, players and teams’ attributes sourced from EA Sports’ FIFA video game series, including weekly updates, team line up with squad formation (X, Y coordinates), betting odds from up to 10 providers, detailed match events (goal types, corner, possession, fouls, etc.) for 10,000+ matches.

Avocado Prices

The dataset shows the historical data on avocado prices and sales volume in multiple US markets. The information has been generated from the Hass Avocado Board website. It represents weekly 2018 retail scan data for national retail volume (units and price, along with region, types (conventional or organic), and Avocado sold volume. The dataset can be applied to other fruits and vegetables across geographies. 

IBM HR Analytics Employee Attrition & Performance

Created by IBM data scientists, this fictional dataset is used to predict attrition in an organisation. It uncovers various factors that lead to employee attrition and explores correlations such as “a breakdown of distance from home by job role and attrition,’ or ‘comparison of average monthly income by education and attrition.’ 

Red Wine Quality 

Red wine quality is a clean and straightforward practice dataset for regression or classification modelling. The two datasets available are related to red and white variants of the Portuguese ‘Vinho Verde’ wine. The information in this dataset includes fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH and others. The dataset is also available on the UCI machine learning repository.

Medical Cost Personal Datasets

This dataset is used for forecasting insurance via regression modelling. The dataset includes age, sex, body mass index, children (dependents), smoker, region and charges (individual medical costs billed by health insurance). The dataset is also available on GitHub

Open Food Facts 

This is a free, open, collaborative database of food products worldwide, with ingredients, allergens, nutrition facts and all the tidbits of information found on product labels. The database is a part of Google’s Summer of Code 2018. 5000+ contributors have added 600K+ products from 150 countries using an app or their camera to scan barcodes and upload pictures of products and their labels. 

Machine Learning & Data Science Survey

Kaggle conducted an industry-wide survey in 2017 to establish a comprehensive overview of the data science and machine learning landscape. The survey received over 16K responses, gathering information around data science, machine learning innovation, how to become data scientists and more. You can find the kernels used in the report here

Titanic

The Titanic dataset consists of original data from the Titanic competition and is ideal for binary logistic regression. The dataset contains information about the passenger’s id, age, sex, fare etc. The Titanic competition involves users creating a machine learning model that predicts which passengers survived the Titanic shipwreck. 

Annotated Corpus for Named Entity Recognition

This dataset is extracted from the GMB (Groningen Meaning Bank) corpus, tagged, annotated and built specifically to train the classifier to predict labelled entities such as name, location, etc. It gives you a broad view of feature engineering and helps solve business problems like picking entities from electronic medical records, etc.

Check out other popular datasets on Kaggle here

What Do You Think?

Join Our Discord Server. Be part of an engaging online community. Join Here.


Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top