Hugging Face has released Datasets, a community library for contemporary NLP. Launched after a year of development, the Datasets library contains 650 unique datasets and has more than 250 contributors. This library has supported several novel cross dataset research projects and shared tasks.
With Datasets, Hugging Face wants to standardise end-user interface, versioning, and documentation and provide a lightweight frontend for internet-scale corpora. It has a distributed, community-driven approach to adding datasets and documenting usage.
Hugging Face’s Datasets
New dataset paradigms have always been crucial to the development of NLP — curated datasets are used for evaluation and benchmarking, supervised datasets are used for fine-tuning models, and large unsupervised datasets are utilised for pretraining and language modelling.
Sign up for your weekly dose of what's up in emerging technology.
Contemporary NLP systems are being developed with pipelines that utilise different datasets at varying scales and levels of annotation. It means that different datasets are being used for pretraining, benchmarking, and fine-tuning.
There are several challenges associated with growing datasets which include interface standardisation, documentation, and versioning. A practitioner should be able to use different datasets without needing different interfaces. Even if the datasets grow in scale, this should have no effect on the choice of interface. Further, uniformity must be maintained in terms of versions; practitioners using the same dataset must have the same version. Lastly, the procedure used to create a dataset — crowdsourcing, scraping, or synthetic generation — should be taken into account while evaluating the most appropriate dataset for a given task.
To this end, Hugging Face’s Datasets are designed to address the associated challenges of dataset management and access to support community culture and norms. The process of developing the Datasets project involved a public hackathon to have community contributors develop new dataset builders and add them to the project; this resulted in 485 commits and an addition of 285 unique contributors to the project. The best takeaway from this event was that the range of different languages the community members spoke helped in reliably bootstrapping the library. The Datasets library now includes continuous data types, multi-dimensional arrays for images, video data, and an audio type.
With Datasets, Hugging Face aims to achieve the following goals:
- Each dataset in the library uses a standard tabular format, is versioned and cited properly. It needs just one line of code to download all the datasets.
- Large datasets can be streamed through the same interface. These datasets are computation and memory efficient and work well with tokenisation and featurisation.
- All the datasets are tagged and documented with their usage, types, and construction.
Datasets are actively used for a number of tasks. As per the authors, popular English benchmarks like GLUE and SQuAD are the most commonly downloaded libraries. Beyond that, there is a range of popular models for different tasks and languages.
Hugging Face: The GitHub for Machine Learning
Hugging Face’s rise as a startup has been phenomenal. In a short time, the company has managed to gain massive attention from the industry. It started as a chatbot but soon pivoted to be a bigger player in the NLP space. Big companies like Apple, Monzo and Bing use their library in production.
One of its top attractions is a Transformer library that PyTorch and TensorFlow back. It provides thousands of pretrained models for performing tasks like text classification, translation, information retrieval, and summarisation. This library has been downloaded over a million times and is extensively used by researchers at Google, Facebook, and Microsoft.
Hugging Face is a strong proponent of open source technology. The company’s founders believe that there is a disconnect between the research and engineering team in NLP. Co-founder Clement Delangue said that the democratisation of AI would extend the benefits of emerging technologies to smaller organisations, which is otherwise concentrated in the hands of a few powerful big companies. Hugging Face aims to be the GitHub for Machine Learning.