Listen to this story
|
Open sourcing projects is one of the best ways to drive innovations from across the community and take it away from the hands of big-tech. But in the past six years, the contribution of big-tech in the open source GitHub community has increased fourfold with Google taking over Microsoft and IBM and Amazon joining the race. There are lots of new open source projects by Meta and the recent innovator, Stability AI, joining the club as well.
Large language models like GPT-3 and text-to-image models like DALL-E had left developers waiting for their open source alternatives that they could get their hands on.
Check out this list of top datasets and projects that were open sourced in 2022 for further contributions and development.
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Also Read: 12 Most Popular Open-Source Projects on GitHub
BLOOM
In June, an early, open source version of BLOOM language model was released by BigScience. It is one of the unique multilingual LLMs trained by the largest collaboration of AI researchers and has 176 billion parameters, which is one billion larger than OpenAI’s GPT-3. It generates text in 46 natural languages which can be coded in 13 programming languages.
Download our Mobile App
Click here to check it out.
Stable Diffusion
When text-to-image generators were on the rise with DALL-E and others, developers wanted to try them out their own. In August, Stability AI announced the public release of Stable Diffusion under the Creative ML OpenRAIL-M licence.
Click here to check it out.
Meta OPT-66B
In June, Meta announced the release of Open Pre-trained Transformer (OPT-66B), which was then one of the largest open-source models to date. It also released the logbooks for training all their baselines with 125M through 66B. This came after Meta released their OPT-175B language along with smaller open-source alternatives for the same.
Click here to check out the repository.
Google Attention Center
Google’s Attention Center is a TensorFlow Lite model that is used for predicting the focus attention point of an image, where the most important and attractive parts of an image lie. You can use a Python script to batch encode images using the attention centres.
Click here to check out the repository on GitHub.
CORD-19
CORD-19, or COVID-19 Open Research Dataset, is a corpus of academic research papers about COVID-19. On June 2, the final version of the corpus was released after it was being updated weekly since March 2020. The host of the GitHub repository has cleaned the data for furthering NLP research efforts.
Click here for the GitHub repository.
Read: Top 10 Indian Government Datasets
Bias in Advertising Data
In June, IBM released its synthetic dataset of user records useful for demonstrating discovery, measurement, and mitigating bias in advertising. The dataset includes individual data of specific users and feature attributes like the gender, age, income, parental status, home ownership, and more.
Check out the release of the dataset here.
Microsoft FarmVibes.AI
FarmVibes.AI algorithms are run on Microsoft’s Azure for predicting the ideal amounts of fertiliser and herbicide. When Microsoft open-sourced their ‘Project FarmVibes’, a suite for farm-focused technologies which is an AI-powered toolkit for guiding decisions in farming. The multi-modal GeoSpatial ML also has an inference engine.
Click here to check out their blog and here for GitHub repository.
NASA and ASDI Climate Dataset
Amazon Sustainability Data Initiative (ASDI) partnered with NASA to accelerate research and innovation in sustainability by providing an open dataset for anyone. In addition to this, the partners are also providing grants to those who are interested in exploring the technology for solving long-term sustainability problems using the provided data.
Click here to know more.
Google Vizier
Introduced in 2017, Google Vizier is an internal service for performing black-box optimization that became the de-facto parameter tuning engine for Google. In July, the company decided to open source it as a standalone Python implementation. Google developed OSS Vizier as a service enabling users to evaluate Trails while also collecting metric and data over time.
Switch Transformer Model on T5X
In June, Google Brain open sourced the Switch Transformer models that included 1.6 trillion param Switch-C along with the 395 billion param Switch-XXL in T5X. It is a modular research friendly framework for high performance and highly configurable inference models at many scales.
Click here to check out the repository.
BERT Language Model
US-based Neural Magic collaborated with Intel Corporation to develop a ‘pruned’ version of the BERT-Large for achieving higher performance in less storage space and open sourced it on HuggingFace in July.
Click here to learn more.
Open Images V7
A dataset of almost 61.4 million images that are annotated with image-level labels, object segmentation masks, object bounding boxes, and visual relationship, Open Images V7 is the latest update of the dataset useful for computer vision tasks.
Click here to check it out.