Meta AI launches an open-sourced model to make Wikipedia entries more accurate

This is a powerful example of machine learning tools that can help scale the work of volunteers by efficiently recommending citations and accurate sources.
meta translation model
Listen to this story

Meta AI has developed the first model capable of automatically verifying hundreds of thousands of citations. Trained over 134 million public web pages, the open-sourced model can check whether the citations support the corresponding claims.

It highlights questionable citations, allowing human editors to assess the cases that are most likely to be flawed without having to sift through thousands of properly cited statements. If a citation appears irrelevant, the model will recommend a more relevant source, even pointing to a specific passage that supports the claim.

“This is a powerful example of machine learning tools that can help scale the work of volunteers by efficiently recommending citations and accurate sources. Improving these processes will allow us to attract new editors to Wikipedia and provide better, more reliable information to billions of people around the world. I look forward to continued improvements in this area, especially as machine learning tools are able to provide more customized citations and multilingual options to serve our Wikimedia communities across more than 300 languages,” said Shani Evenstein Sigalov, a lecturer and researcher at Tel Aviv University, and Vice Chair of the Wikimedia Foundation’s Board of Trustees.

Learning all of Wikipedia 

In September 2020, Meta released an AI model that integrates information retrieval and verification. Since then, the company has been working on training neural networks to learn more nuanced representations of language so that they can find relevant source material in a pool of data the size of the internet.

Using natural language understanding (NLU) techniques, the system estimates the likelihood that a claim can be inferred from a source. To determine whether one statement supports or contradicts another, the models create and compare mathematical representations of the meanings of entire statements during a search.

The new dataset of 134 million web pages serves as one of the system’s main components: Sphere, an open-sourced web-scale retrieval library. Meta has fed the algorithms 4 million Wikipedia claims, teaching them to point out a single source from a vast pool of web pages to validate each statement. Because webpages can contain long stretches of text, the models evaluate content in chunks and take only the most relevant passage into account when deciding whether to recommend a URL. These prebuilt indices, which catalogue 40 times more content than other Wikipedia indices, will be included with Sphere.

The indices route potential sources through an evidence-ranking model that compares the new text to the original citation. The model ranks the cited source and the retrieved alternatives based on the likelihood that they support the claim using fine-grained language comprehension. In the real world, the model will recommend the most relevant URLs as prospective citations for a human editor to review and approve.

Making sense of the real world

Meta’s ultimate goal is to create a platform that will assist Wikipedia editors in systematically identifying citation issues and quickly fixing the citation or correcting the content of the corresponding article at scale.

This model could also guide the way to better results on many other tasks, such as classic natural language inference, retrieval in question-answering systems, and few-shot learning.

Download our Mobile App

Sri Krishna
Sri Krishna is a technology enthusiast with a professional background in journalism. He believes in writing on subjects that evoke a thought process towards a better world. When not writing, he indulges his passion for automobiles and poetry.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Bangalore

Future Ready | Lead the AI Era Summit

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

20th June | Bangalore

Women in Data Science (WiDS) by Intuit India

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox