Last updated October 7, 2021
In AI Mysteries

Now Hugging Face Gives Away 650 NLP Datasets For Free

With Datasets, Hugging Face wants to standardise end-user interface, versioning, and documentation, and provide a lightweight frontend for internet-scale corpora.

Published on September 16, 2021
by Shraddha Goled

Hugging Face has released Datasets, a community library for contemporary NLP. Launched after a year of development, the Datasets library contains 650 unique datasets and has more than 250 contributors. This library has supported several novel cross dataset research projects and shared tasks.

With Datasets, Hugging Face wants to standardise end-user interface, versioning, and documentation and provide a lightweight frontend for internet-scale corpora. It has a distributed, community-driven approach to adding datasets and documenting usage.

Hugging Face’s Datasets

New dataset paradigms have always been crucial to the development of NLP — curated datasets are used for evaluation and benchmarking, supervised datasets are used for fine-tuning models, and large unsupervised datasets are utilised for pretraining and language modelling.

Contemporary NLP systems are being developed with pipelines that utilise different datasets at varying scales and levels of annotation. It means that different datasets are being used for pretraining, benchmarking, and fine-tuning.

There are several challenges associated with growing datasets which include interface standardisation, documentation, and versioning. A practitioner should be able to use different datasets without needing different interfaces. Even if the datasets grow in scale, this should have no effect on the choice of interface. Further, uniformity must be maintained in terms of versions; practitioners using the same dataset must have the same version. Lastly, the procedure used to create a dataset — crowdsourcing, scraping, or synthetic generation — should be taken into account while evaluating the most appropriate dataset for a given task.

To this end, Hugging Face’s Datasets are designed to address the associated challenges of dataset management and access to support community culture and norms. The process of developing the Datasets project involved a public hackathon to have community contributors develop new dataset builders and add them to the project; this resulted in 485 commits and an addition of 285 unique contributors to the project. The best takeaway from this event was that the range of different languages the community members spoke helped in reliably bootstrapping the library. The Datasets library now includes continuous data types, multi-dimensional arrays for images, video data, and an audio type.

With Datasets, Hugging Face aims to achieve the following goals:

Each dataset in the library uses a standard tabular format, is versioned and cited properly. It needs just one line of code to download all the datasets.
Large datasets can be streamed through the same interface. These datasets are computation and memory efficient and work well with tokenisation and featurisation.
All the datasets are tagged and documented with their usage, types, and construction.

Datasets are actively used for a number of tasks. As per the authors, popular English benchmarks like GLUE and SQuAD are the most commonly downloaded libraries. Beyond that, there is a range of popular models for different tasks and languages.

Hugging Face: The GitHub for Machine Learning

Hugging Face’s rise as a startup has been phenomenal. In a short time, the company has managed to gain massive attention from the industry. It started as a chatbot but soon pivoted to be a bigger player in the NLP space. Big companies like Apple, Monzo and Bing use their library in production.

One of its top attractions is a Transformer library that PyTorch and TensorFlow back. It provides thousands of pretrained models for performing tasks like text classification, translation, information retrieval, and summarisation. This library has been downloaded over a million times and is extensively used by researchers at Google, Facebook, and Microsoft.

Hugging Face is a strong proponent of open source technology. The company’s founders believe that there is a disconnect between the research and engineering team in NLP. Co-founder Clement Delangue said that the democratisation of AI would extend the benefits of emerging technologies to smaller organisations, which is otherwise concentrated in the hands of a few powerful big companies. Hugging Face aims to be the GitHub for Machine Learning.

Access all our open Survey & Awards Nomination forms in one place >>

Shraddha Goled

I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.

Now Hugging Face Gives Away 650 NLP Datasets For Free

Hugging Face’s Datasets

Hugging Face: The GitHub for Machine Learning

Shraddha Goled

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discord Server

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

KissanAI Releases Dhenu Llama 3, an Indic LLM for Farmers

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Is it Humane to Bash Humane Ai Pin?

Meta Llama 3 Now Available on Databricks For Enterprise

How Databricks is Enabling Agriculture’s Data Revolution with UPL

How Good is Llama 3 for Indic Languages?

OpenAI Hires Pragya Misra As Its First Employee in India

Meta Forces Developers Cite ‘Llama 3’ in their AI Development

India is Making its Own AI Servers

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.

AIM Launches the 3rd Edition of Data Engineering Summit. May 30-31, Bengaluru