Listen to this story
Meta AI has developed the first model capable of automatically verifying hundreds of thousands of citations. Trained over 134 million public web pages, the open-sourced model can check whether the citations support the corresponding claims.
It highlights questionable citations, allowing human editors to assess the cases that are most likely to be flawed without having to sift through thousands of properly cited statements. If a citation appears irrelevant, the model will recommend a more relevant source, even pointing to a specific passage that supports the claim.
“This is a powerful example of machine learning tools that can help scale the work of volunteers by efficiently recommending citations and accurate sources. Improving these processes will allow us to attract new editors to Wikipedia and provide better, more reliable information to billions of people around the world. I look forward to continued improvements in this area, especially as machine learning tools are able to provide more customized citations and multilingual options to serve our Wikimedia communities across more than 300 languages,” said Shani Evenstein Sigalov, a lecturer and researcher at Tel Aviv University, and Vice Chair of the Wikimedia Foundation’s Board of Trustees.
Learning all of Wikipedia
In September 2020, Meta released an AI model that integrates information retrieval and verification. Since then, the company has been working on training neural networks to learn more nuanced representations of language so that they can find relevant source material in a pool of data the size of the internet.
Using natural language understanding (NLU) techniques, the system estimates the likelihood that a claim can be inferred from a source. To determine whether one statement supports or contradicts another, the models create and compare mathematical representations of the meanings of entire statements during a search.
The new dataset of 134 million web pages serves as one of the system’s main components: Sphere, an open-sourced web-scale retrieval library. Meta has fed the algorithms 4 million Wikipedia claims, teaching them to point out a single source from a vast pool of web pages to validate each statement. Because webpages can contain long stretches of text, the models evaluate content in chunks and take only the most relevant passage into account when deciding whether to recommend a URL. These prebuilt indices, which catalogue 40 times more content than other Wikipedia indices, will be included with Sphere.
The indices route potential sources through an evidence-ranking model that compares the new text to the original citation. The model ranks the cited source and the retrieved alternatives based on the likelihood that they support the claim using fine-grained language comprehension. In the real world, the model will recommend the most relevant URLs as prospective citations for a human editor to review and approve.
Making sense of the real world
Meta’s ultimate goal is to create a platform that will assist Wikipedia editors in systematically identifying citation issues and quickly fixing the citation or correcting the content of the corresponding article at scale.
This model could also guide the way to better results on many other tasks, such as classic natural language inference, retrieval in question-answering systems, and few-shot learning.