Top open source datasets from Amazon

Over the years, Amazon and AWS have contributed massively to the open-source community by releasing their comprehensive datasets to the public.

Amazon has open-sourced Multilingual Amazon SLURP for Slot Filling, Intent Classification, and Virtual-assistant Evaluation (MASSIVE), a speech dataset that supports 51 languages to encourage developers to build more third-party apps and tools for its AI speaker device Alexa. It contains one million spoken samples and an open-source code to train multilingual AI models. It has been compiled through translators translating an English-only dataset into several languages spoken across Africa, Latin America, Europe, and Asia. It largely contains questions or common commands like asking a device to play a song or checking the weather situation.

Over the years, Amazon and AWS have contributed massively to the open-source community by releasing their comprehensive datasets to the public. We will have a look at a few of them in this article.

Amazon Customer Reviews dataset

Amazon Customer Reviews is a collection of product reviews that have been collected over a period of over two decades. It contains over a hundred million reviews where customers have described their experience with products bought from the website. This makes the data a rich source of information for academic research, particularly in the field of NLP, information retrieval, and machine learning, among others. This dataset has been created to represent a sample of customer evaluations and opinions, which also reflect the variation in the perception of the same product across different geographical regions.


Sign up for your weekly dose of what's up in emerging technology.

Amazon Berkley Objects dataset

Last year, Amazon and the University of California, Berkeley, jointly released the Amazon Berkley Objects dataset. It is a massive dataset of product images and associated metadata for supporting research on product information management, visual understanding, and information retrieval. It would like researchers to develop more powerful AI models for image-based shopping and for expanding retailers’ product graphs. This dataset includes images of close to 150,000 products that are all annotated with metadata like multilingual title, model, brand, product type, and dimensions, among others. Further, there are close to 400,000 static catalogue images, over 8,000 images that provide 360-degree rotations in the plane at 5-degree intervals, and over 7,000 product models that can be rotated along any axis and rendered in any 3D environment under different lighting conditions.


Launched in 2016, SpaceNet is an open innovation project that offers a repository of freely available imagery with co-registered map features. SpaceNet hosts datasets developed by its team along with data sets from projects like IARPA’s Functional Map of the World (fMoW). Before SpaceNet, researchers had much lesser options to get free, precision-labelled and high-resolution satellite imagery.

Download our Mobile App

Cancer Genome Atlas

The Cancer Genome Atlas is the result of a collaboration between the National Cancer Institute and the National Human Genome Research Institute. By analysing matched tumour and normal tissue samples from 11,000 patients, the group aims to generate comprehensive and multi-dimensional maps of key genomic changes in major types of cancer. The group was able to chart out a comprehensive characterisation of 33 cancer types and subtypes, including ten rare cancers. This dataset contains Clinical Supplement, miRNA-Seq Isoform Expression Quantification, Genotyping Array Masked Copy Number Segment, Genotyping Array Gene Level Copy Number Scores, and WXS Masked Somatic Mutation data from Genomic Data Commons (GDC), Whole Exome Sequencing (WXS), RNA-Seq, miRNA-Seq, and WXS Aggregated Somatic Mutation data.

Genome Aggregation database

The Genome Aggregation Database (gnomAD) is developed jointly by an international coalition of investigators who aggregate both exome and genome data from a range of large-scale human sequencing projects. The v2 data set of GRCh37 spans 125,748 exome sequences and 15,708 whole-genome sequences from unrelated persons. The v3 data set or GRCh38 contains 71,702 genomes selected as in v2.

Foldingathome COVID-19 Datasets

Folding@home is a major distributed computing project which uses biomolecular simulations to find the molecular origins of disease to accelerate the discovery of newer treatments. During the COVID-19 pandemic, Folding@home partnered with several experimental collaborators to accelerate the progress toward building effective therapies for treating COVID-19. One of the outcomes of these efforts was the creation of the world’s first exascale distributed computing resource to generate scientific datasets of massive size.

More Great AIM Stories

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at

AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox