2019 Began With Tech Giants Google, Facebook & Stanford Open-sourcing Datasets

Facebook Open Sources LASER

The month of January witnessed a myriad of research works open sourced by the likes of Google, Facebook and Stanford.

These works set new benchmarks for NLP and image recognition techniques while improving on the pre-existing models. Recently Google introduced a paper on Natural Questions (NQ), a new dataset for QA research, along with methods for QA system evaluation.

In contrast to tasks where it is relatively easy to gather naturally occurring examples, the definition of a suitable QA task, and the development of a methodology for annotation and evaluation is challenging. When an annotator is asked a question, it returns a longer version from the paragraphs of Wikipedia and also a short answer like a yes or no.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

The question seeks factual information; the Wikipedia page may or may not contain the information required to answer the question; the long answer is a bounding box on this page containing all information required to infer the answer; and the short answer is one or more entities that give a short answer to the question, or a boolean ‘yes’ or ‘no’. Both the long and short answer can be NULL if no viable candidates exist on the Wikipedia page.

Facebook Open Sources LASER

Download our Mobile App

To accelerate the transfer of natural language processing (NLP) applications to many more languages, Facebook enhanced their LASER(Language-Agnostic SEntence Representations) toolkit.

LASER is the first successful exploration of massively multilingual sentence representations to be shared publicly with the NLP community. The toolkit now works with more than 90 languages, written in 28 different alphabets. LASER achieves these results by embedding all languages jointly in a single shared space (rather than having a separate model for each). We The multilingual encoder and PyTorch code is freely available, along with a multilingual test set for more than 100 languages.

This work is aimed at applications such as classifying movie reviews as positive or negative, in one language and then instantly deploy it in more than 100 other languages.

Stanford’s CheXpert

Probability prediction from chest radiographs via paper by Jeremy et al.,

CheXpert, a large dataset that contains 224,316 chest radiographs of 65,240 patients. A labeller to automatically detect the presence of 14 observations in radiology reports, capturing uncertainties inherent in radiograph interpretation.

The researchers at Stanford investigated different approaches to using the uncertainty labels for training convolutional neural networks(CNN) that output the probability of these observations given the available frontal and lateral radiographs.

On a validation set of 200 chest radiographic studies which were manually annotated by 3 board-certified radiologists, it is found that different uncertainty approaches are useful for different pathologies.

The best model on a test set is composed of 500 chest radiographic studies annotated by a consensus of 5 board-certified radiologists, and compare the performance of our model to that of 3 additional radiologists in the detection of 5 selected pathologies. On Cardiomegaly, Edema, and Pleural Effusion, the model ROC and PR curves lie above all 3 radiologist operating points.

This dataset can be used as a standard benchmark to evaluate the performance of chest radiograph interpretation models.


The year 2018 has seen a meteoric rise in the number of papers released in the field of AI. There were also numerous tools and techniques open sourced by the giants to carry the baton of AI research. Google’s BERT, for instance, introduced new benchmarks for natural language understanding.

Along with LASER, Facebook also released BISON for providing systems with the ability to relate linguistic and visual content is one of the hallmarks of computer vision.

Google and Facebook upped the ante right from the beginning and the rest of the year sure looks interesting for an AI enthusiast.

Check the other releases here.


Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: The Rise of Generative AI and Living Content

In this era of content, the use of technology, such as AI and data analytics, is becoming increasingly important as it can help content creators personalise their content, improve its quality, and reach their target audience with greater efficacy. AI writing has arrived and is here to stay. Once we overcome the initial need to cling to our conventional methods, we can begin to be more receptive to the tremendous opportunities that these technologies present.