To Build a perfect model, you need a large amount of data. But finding the right dataset for your machine learning and data science project is sometimes quite a challenging task. There are many organizations, researchers, and individuals who have shared their work, and we will use their datasets to build our project.
So in this article, we are going to discuss 20+ Machine learning and Data Science dataset and project ideas that you can use for practicing and upgrading your skills.
1. Enron Email Dataset
Enron Dataset is famous in natural language processing. It has more than 500K emails of over 150 users. The size of the data is around 432Mb. Out of 150 users, most are the senior management of Enron.
a. Data Link: Enron email dataset
b. Project Idea: Using k-means clustering, you can build a model to detect fraudulent activities. K-means clustering is an unsupervised Machine learning algorithm. It separates the observations into k number of clusters based on the similar patterns in the data.
2. Chatbot Intents Dataset
The Chatbot dataset is a JSON file that has disparate tags like goodbye, greetings, pharmacy_search, hospital_search, etc. Every tag has a list of patterns that a user can ask, and the chatbot will respond according to that pattern. The dataset is perfect for understanding how chatbot data works.
a. Data Link: Intents JSON Dataset
b. Project Idea: You can build a chatbot or understand the working of a chatbot by twisting and expanding the data with your observations. To build a Chatbot of your own, you need to have a good knowledge of Natural language processing concepts.
c. Source Code: Chatbot Project in Python
3. Flickr 30k Dataset
The Flickr 30k dataset has over 30,000 images, and each image has different captions. This dataset is useful in building image caption generators. And this dataset is an upgraded version of Flickr 8k used to create more accurate models.
a. Data Link: Flickr image dataset
- Project Idea: You can build a CNN model that is great for analyzing and extracting features from the image and generate an English sentence that describes the image that is called Caption.
4. Parkinson Dataset
Parkinsons is a disease that can cause a nervous system disorder and affects the movement. Parkinson dataset contains biomedical measurements, 195 records of people with 23 different attributes. This data is useful in differentiating healthy people and people with Parkinsons disease.
- Data Link: Parkinson dataset
- Project Idea: You can build a model that can separate healthy people from people having Parkinson’s disease. The XGboost algorithm is useful for this purpose. It stands for extreme gradient boosting, and it uses decision trees.
- Source Code: ML Project on Detecting Parkinson’s Disease
5. Iris Dataset
The iris dataset is a beginner-friendly dataset that has information about the petal and sepal sizes of the three species of the iris flower. It is also known as Fisher’s iris dataset after Ronald Fisher used it in his paper in 1936. This dataset is highly useful as a beginner’s tool for machine learning purposes. It contains 150 rows with four columns.
Data Link: Iris dataset
Project Idea: Classification is the task of separating items into their corresponding class. You can implement a machine learning classification or regression model on the dataset. This dataset is also instrumental in learning the differences between supervised and unsupervised learning.
6. ImageNet dataset
ImageNet is an extensive image database primarily used for object recognition software research. It is organized according to the wordnet hierarchy. It has over 20,000 categories and hundreds of images per category. Its size exceeds 150 GB. ImageNet project handles the database and hosts a challenging competition named ILSVRC for people to build more and more accurate object-recognition and image-classification models.
- Data Link: Imagenet Dataset
- Project Idea: To implement image classification on this vast database and recognize objects. CNN (Convolutional neural network) is a deep learning algorithm that is highly useful for this project to get accurate results.
7. Mall Customers Dataset
The Mall customers dataset holds the details about people visiting the mall. The dataset has an age, customer id, gender, annual income, and spending score. It gains insights from the data and divides the customers into different groups based on their behaviors.
- Dataset Link: mall customers dataset
- Project Idea: Segment the customers based on their gender, age, interest. It is useful in customized marketing. Customer segmentation is a crucial algorithm that divides customers based on individual groups that are similar.
- Source Code: Customer segmentation with Machine learning.
8. Google Trends Data Portal
Google provides the google trends service, which makes the search data and trends available to everyone. This data can help in examining and analyzing popular searches and trending topics in specific areas and also all over the world. You can also download the data as CSV files for free.
- Data Link: Google trends datasets
9. The Boston Housing Dataset
The Boston Housing Dataset is one of the most popular datasets used for pattern recognition. The dataset contains information about houses in Boston like crime rate, tax, number of rooms, etc. It has 506 observations of 14 different variables. With this dataset, you can predict house prices.
- Data Link: Boston dataset
- Project Idea: Build a model to predict the costs of a new house using regression. Regression techniques can determine relationships between different variables and can also predict the values of variables based on these relationships.
10. Uber Pickups Dataset
This dataset has information on around 4.5 million uber rides in New York City from April 2014 to September 2014 and about 14 million more from January 2015 to June 2015. It contains details location information associated with every ride recorded. This dataset is very useful for density analysis and pattern recognition.
- Data Link: Uber pickups dataset
- Project Idea: To analyze the data of the customer rides and visualize the data. Visualization helps in gathering insights that can help improve business Intelligence. Data analysis and visualization help gather insights from the data, and with visualization, you can get quick information from the data.
11. Recommender Systems Dataset
The Recommender Systems Dataset is a portal to a collection of rich datasets used in lab research projects at UCSD. These datasets from popular websites like Goodreads book reviews, Amazon product reviews, bartending data, data from social media, etc. are used in building recommendation systems.
- Data Link: Recommender systems dataset
- Project Idea: Build a product recommendation system like Amazon. A recommendation system can suggest the user products, movies, etc. based on their interests and the things they like and have used earlier.
- Source Code: Movie Recommendation System Project
12. UCI Spambase Dataset
A major problem that every email and messaging service is continuously working on is to classify emails as spam or non-spam. The UCI Spambase dataset contains 4601 emails and 57 meta-information about the emails. This information can help build models to filter out the spam.
- Data Link: UCI spam base dataset
- Project Idea: You can build a model that can identify your emails as spam or non-spam. The real challenge with this project is avoiding classifying non-spam emails as spam.
13. GTSRB (German traffic sign recognition benchmark) Dataset
The GTSRB dataset contains images of traffic signs belonging to 43 different classes. It contains around 50,000 images and information on the bounding box of each sign. The dataset is useful for multiclass classification.
- Data Link: GTSRB dataset
- Artificial Intelligence Project Idea: Build a model using a deep learning framework that classifies traffic signs. Traffic sign classification is useful in creating self-driving vehicles.
- Source Code: Traffic Signs Recognition Python Project
14. Cityscapes Dataset
The cityscapes dataset is a dataset for Computer Vision projects. It is open-source and contains high-quality pixel-level annotations of video sequences taken in 50 different city streets. The dataset is useful in training deep neural networks to understand the urban scene.
- Data Link: Cityscapes dataset
- Project Idea: Using the image segmentation algorithm to detect different objects from a video. The image segmentation algorithm digitally partitions an image into different categories like cars, buses, people, trees, roads, etc.
15. Kinetics Dataset
The kinetics dataset is a collection of videos showing simple human-human or human-object interactions. It is a collection of three different datasets that contains a URL link to around 650,000 high-quality videos combined.
- Data Link: Kinetics dataset
- Project Idea: Building a human action recognition model. The model should be able to detect the actions of a human.
16. IMDB-Wiki dataset
The IMDB-Wiki dataset is highly useful for training gender and age classifiers. It is one of the most massive open-source datasets of labeled facial images. The images have gender and age labels with them. It is a collection of almost 5 million labeled images.
- Data Link: IMDB wiki dataset
- Project Idea: Make a model that will detect faces and predict their gender and age. Classifying images into age groups would be much more feasible than predicting the exact age.
17. Color Detection Dataset
The color detection dataset contains 865 color names with their corresponding RGB(red, green, and blue) values and hexadecimal values.
- Data Link: Color Detection Dataset
- Project Idea: Creating a color detection app. The app can have an interface to pick a color from an image and then display the name of the color.
- Source Code: Color Detection Python Project
18. Urban Sound 8K dataset
The urban sound dataset is useful for sound classification and recognition. It contains 8732 urban sounds. The sounds are classified into ten classes like a dog bark, siren, air conditioner, street music, drilling, etc.
- Data Link: Urban Sound 8K dataset
- Project Idea: Building a sound classification system to detect the type of sound playing in the background. This project will help you understand how to work with unstructured data and get started with audio data.
19. Librispeech Dataset
The Librispeech dataset contains a large number of English speeches derived from the LibriVox project. It is useful for speech recognition and natural language processing projects. It has around 1000 hours of English read speech in various accents.
- Data Link: Librispeech dataset
- Project Idea: Build a speech recognition model to detect what is being said and convert it into text. The objective is to build a speech to text converter. It should be able to identify what is being said in the audio automatically.
20. Breast Histopathology Images Dataset
The Breast Histopathology Images dataset contains around 2,77,524 images. These images are extracted from mount slide images of breast cancer specimens. There are 78,786 positive tests and 1,98,738 negative tests.
- Data Link: Breast histopathology dataset
- Project Idea: To build a model that can classify a tumor image as malignant or benign. You can build an image classification model with CNN(Convolutional Neural Networks).
- Source Code: Breast Cancer Classification Python Project
21. Youtube 8M Dataset
The youtube 8M dataset is a large dataset of labeled videos used for video classification purposes. It has about 6.1 million Youtube video ids, 350,000 hours of video, 2.6 billion audio/visual features, and 3862 classes. It has an average of three labels per video.
- Data Link: Youtube 8M
- Project Idea: Build a model that can describe what a video is about. The model takes a series of inputs to classify in which category the video belongs.
In this article, we saw more than 20 machine learning datasets that you can use to practice machine learning or data science algorithms. Creating a dataset of your own is expensive. Using other people’s datasets to get our work done is more feasible for learning purposes. But we should read the documents of the dataset carefully.
Provide your comments below
Rahul Patodi is a part of the AIM Writers Programme. He is a Big Data Architect and works on the latest cutting edge technologies like Big Data, Data Science, ML, DL and AI which are transforming the world.