Top 12 Datasets and Projects Open Sourced in 2022

Open sourcing of these projects and datasets has driven innovation and development from the developer community.
Listen to this story

Open sourcing projects is one of the best ways to drive innovations from across the community and take it away from the hands of big-tech. But in the past six years, the contribution of big-tech in the open source GitHub community has increased fourfold with Google taking over Microsoft and IBM and Amazon joining the race. There are lots of new open source projects by Meta and the recent innovator, Stability AI, joining the club as well.

Large language models like GPT-3 and text-to-image models like DALL-E had left developers waiting for their open source alternatives that they could get their hands on.

Check out this list of top datasets and projects that were open sourced in 2022 for further contributions and development.

Also Read: 12 Most Popular Open-Source Projects on GitHub


In June, an early, open source version of BLOOM language model was released by BigScience. It is one of the unique multilingual LLMs trained by the largest collaboration of AI researchers and has 176 billion parameters, which is one billion larger than OpenAI’s GPT-3. It generates text in 46 natural languages which can be coded in 13 programming languages.

Click here to check it out.

Stable Diffusion

When text-to-image generators were on the rise with DALL-E and others, developers wanted to try them out their own. In August, Stability AI announced the public release of Stable Diffusion under the Creative ML OpenRAIL-M licence. 

Click here to check it out.

Meta OPT-66B

In June, Meta announced the release of Open Pre-trained Transformer (OPT-66B), which was then one of the largest open-source models to date. It also released the logbooks for training all their baselines with 125M through 66B. This came after Meta released their OPT-175B language along with smaller open-source alternatives for the same.

Click here to check out the repository.

Google Attention Center

Google’s Attention Center is a TensorFlow Lite model that is used for predicting the focus attention point of an image, where the most important and attractive parts of an image lie. You can use a Python script to batch encode images using the attention centres.

Click here to check out the repository on GitHub.


CORD-19, or COVID-19 Open Research Dataset, is a corpus of academic research papers about COVID-19. On June 2, the final version of the corpus was released after it was being updated weekly since March 2020. The host of the GitHub repository has cleaned the data for furthering NLP research efforts. 

Click here for the GitHub repository.

Read: Top 10 Indian Government Datasets

Bias in Advertising Data

In June, IBM released its synthetic dataset of user records useful for demonstrating discovery, measurement, and mitigating bias in advertising. The dataset includes individual data of specific users and feature attributes like the gender, age, income, parental status, home ownership, and more. 

Check out the release of the dataset here.

Microsoft FarmVibes.AI

FarmVibes.AI algorithms are run on Microsoft’s Azure for predicting the ideal amounts of fertiliser and herbicide. When Microsoft open-sourced their ‘Project FarmVibes’, a suite for farm-focused technologies which is an AI-powered toolkit for guiding decisions in farming. The multi-modal GeoSpatial ML also has an inference engine.

Click here to check out their blog and here for GitHub repository.

NASA and ASDI Climate Dataset

Amazon Sustainability Data Initiative (ASDI) partnered with NASA to accelerate research and innovation in sustainability by providing an open dataset for anyone. In addition to this, the partners are also providing grants to those who are interested in exploring the technology for solving long-term sustainability problems using the provided data. 

Click here to know more.

Google Vizier

Introduced in 2017, Google Vizier is an internal service for performing black-box optimization that became the de-facto parameter tuning engine for Google. In July, the company decided to open source it as a standalone Python implementation. Google developed OSS Vizier as a service enabling users to evaluate Trails while also collecting metric and data over time.

Switch Transformer Model on T5X

In June, Google Brain open sourced the Switch Transformer models that included 1.6 trillion param Switch-C along with the 395 billion param Switch-XXL in T5X. It is a modular research friendly framework for high performance and highly configurable inference models at many scales. 

Click here to check out the repository. 

BERT Language Model

US-based Neural Magic collaborated with Intel Corporation to develop a ‘pruned’ version of the BERT-Large for achieving higher performance in less storage space and open sourced it on HuggingFace in July. 

Click here to learn more.

Open Images V7

A dataset of almost 61.4 million images that are annotated with image-level labels, object segmentation masks, object bounding boxes, and visual relationship, Open Images V7 is the latest update of the dataset useful for computer vision tasks. 

Click here to check it out.

Read: Top 9 Indian Open-source Projects in 2022

Download our Mobile App

Mohit Pandey
Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

6 IDEs Built for Rust

Rust IDEs aid efficient code development by offering features like code completion, syntax highlighting, linting, debugging tools, and code refactoring