10 Datasets For Data Cleaning Practice For Beginners

In order to create quality data analytics solutions, it is very crucial to wrangle the data. The process includes identifying and removing inaccurate and irrelevant data, dealing with the missing data, removing the duplicate data, etc. Thus, eliminating the major inconsistencies and making the data more efficient to work with.

In this article, we list down 10 datasets for beginners, which can be used for data cleaning practice or data preprocessing. 

(The list is in alphabetical order)

1| Common Crawl Corpus

Common Crawl is a corpus of web crawl data composed of over 25 billion web pages. For all crawls since 2013, the data has been stored in the WARC file format and also contains metadata (WAT) and text data (WET) extracts. The dataset can be used in natural language processing (NLP) projects. 

Get the data here.

2| Google Books Ngrams

Google Books Ngrams is a dataset containing Google Books n-gram corpora. N-grams are fixed size tuples of items. In this dataset, the items are words extracted from the Google Books corpus. The size of the dataset is 2.2 TB.

Get the data here.

3| Hourly Weather Surface – Brazil (Southeast region)

The Hourly Weather Surface – Brazil (Southeast region) covers hourly weather data from 122 weather stations of the southeast region (Brazil).The size of the dataset is 2 GB, and there are 17 climate parameters (continuous values) from 122 weather stations. The contents of the dataset include instant air temperature, relative humidity of the air, instant dew point, solar radiation, among others. 

Get the data here.

4| Hotel Booking Demand

The Hotel Booking demand dataset contains booking information for a city hotel and a resort hotel. It includes information such as booking time, length of stay, number of adults, children/babies, number of available parking spaces, among other things. This dataset is ideal for anyone looking to practice their exploratory data analysis (EDA) or get started in building predictive models. 

Get the data here.

5| Iris Species 

The Iris Species is the Iris Plant Database, which contains three classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other two, and the latter are not linearly separable from each other. The columns of this dataset include Id, Sepallength, PetalLength, etc. 

Get the data here.

6| New York City Airbnb Open Data

The New York City Airbnb Open Data is a public dataset and a part of Airbnb. It includes all needed information to find out more about hosts, geographical availability, necessary metrics to make predictions and draw conclusions. This dataset describes the listing activity and metrics in NYC, NY, for 2019.

Get the data here.

7| Slogan Dataset

The Slogan dataset can be used to analyse slogans of various organisations. It includes a list of slogans in the form of company_name, company_slogan. The data has been acquired from slogan-list.com, which contains more than 1000 pairs of “company, slogan” spread across 10+ categories.

Get the data here.

8| Taxi Trajectory Data

The Taxi Trajectory dataset provides a complete year (from 01/07/2013 to 30/06/2014) of the trajectories for all the 442 taxis running in the city of Porto, Portugal. Each ride has been categorised into three sub-categories which are taxi central based, stand-based and non-taxi central based. Each data sample corresponds to one completed trip and contains a total of nine features.

Get the data here.

9| Temperature Readings: IoT Devices

The Temperature Readings: IoT Devices dataset contains the temperature readings from IoT devices installed outside and inside of an anonymous room. The size of the data is 7 MB, and it has 5 columns with 97605 rows. The dataset can be used for time-series analysis project.

Get the data here.

The Trending YouTube Video Statistics is a daily record with daily statistics for trending Youtube videos which were collected using YouTube API. It includes several months (and counting) of data on daily trending YouTube videos, with up to 200 listed trending videos per day. Each region’s data is in a separate file. Data includes the video title, channel title, publish time, tags, views, likes and dislikes, description, and comment count.

Get the data here.

More Great AIM Stories

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Victor Dey
Understanding the Importance of Data Cleaning and Normalization

Data Cleaning is a critical aspect of the domain of data management. The data cleansing process involves reviewing all the data present within a database to either remove or update information that is incomplete, incorrect or duplicated and irrelevant. Data cleansing is just not simply about erasing the old information to make space for new data, but the process is about rather finding a way to maximize the dataset’s accuracy without necessarily tampering with the data available. Data Cleaning is the process of determining and correcting the wrong data. Organizations rely on data for most things but only a few properly address the data quality. 

Victor Dey
When to Use One-Hot Encoding in Deep Learning?

One hot encoding can be defined as the essential process of converting the categorical data variables to be provided to machine and deep learning algorithms which in turn improve predictions as well as classification accuracy of a model. One Hot Encoding is a common way of preprocessing categorical features for machine learning models.

Victor Dey
Exploring DataPrep: A Python Library For Data Preparation & EDA

DataPrep is an open-source library available for python that lets you prepare your data using a single library with only a few lines of code. DataPrep can be used to address multiple data-related problems, and the library provides numerous features through which every problem can be solved and taken care of.

Victor Dey
How To Process Humongous Datasets Using Vaex?

Vaex is a Python library for Out-of-Core DataFrames and helps to load, visualize and explore big tabular datasets. It can aid in calculating statistical operations such as mean, sum, count, standard deviation etc., on an N-dimensional grid, up to a billion rows per second.

Vijaysinh Lendave
Comprehensive Guide To Web Scraping With Selenium

Web scraping, surveys, questionnaires, focus groups, etc., are some of the widely used mechanisms for gathering insightful data. However, web scraping is considered the most reliable and efficient data collection method out of all these methods. Web scraping, also termed as web data extraction, is an automatic method for scraping large data from websites. It processes the HTML of a web page to extract data for manipulation, such as collecting textual data and storing it into some data frames or in a database.

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM