Meta Platforms Inc. has developed an artificial intelligence system that can scan a Wikipedia article, analyze the sources cited by the article and identify if some of them may need to be changed.
Meta detailed the AI system today. The company also released the code for the system under an open-source license.
Wikipedia editors ensure that a given piece of information in a Wikipedia article is accurate by checking the source from which the information was retrieved. Checking all the sources cited in an article can be a time-consuming process. Wikipedia has millions of pages, some of which contain upwards of hundreds of citations.
Meta’s newly released AI system aims to ease the work of Wikipedia editors by partly automating the task of reviewing citations. The system can scan an article and identify if there are pieces of information that are backed up by a questionable citation. Moreover, it’s capable of recommending more relevant sources with which the questionable citation can be replaced.
A Wikipedia article about a certain Apple Inc. product might accidentally reference a page on Apple’s website that discusses an entirely different product. Meta’s newly detailed AI system could determine that such a citation is incorrect. Moreover, it could recommend the correct page on Apple’s website that the Wikipedia article should reference.
Meta taught the AI system to detect erroneous citations by training it on 4 million snippets of text of information from Wikipedia. Additionally, Meta created a dataset called Sphere that contains 134 million documents sourced from the open web. When it finds a questionable citation in a Wikipedia article, the AI system searches the documents in the Sphere dataset to find a more relevant source.
The process through which the system finds a new source with which a questionable citation can be replaced involves multiple steps.
Because the Sphere dataset contains 134 million documents, searching it for potential citations can take a significant amount of time. Meta’s researchers have sped up the task by developing a collection of specialized indices. In a data management context, indices are collections of shortcuts that make it possible to find specific pieces of information faster.
Meta’s AI system uses the indices developed by the company to more quickly search the Sphere dataset for citations than would otherwise be possible. When the system finds a document that could potentially be cited as a source, it extracts the most relevant passage from the document. It’s also capable of determining if there are multiple documents that could potentially be cited as a source.
According to Meta, the system determines whether a document from Sphere backs up a piece of information found in a Wikipedia article by creating mathematical representations of both text snippets. Those mathematical representations are then compared to identify which is most relevant.
“We’ve designed our tools to compare these representations in order to determine whether one statement supports or contradicts another,” Meta researchers detailed in a blog post today. If the AI system finds multiple documents that could be cited as a source, it ranks them based on their likelihood of being relevant.
“Using fine-grained language comprehension, the model ranks the cited source and the retrieved alternatives according to the likelihood that they support the claim,” the researchers elaborated. “When deployed in the real world, the model will offer the most relevant URLs as prospective citations for a human editor to review and approve.”
In addition to the AI system itself, Meta has open-sourced the Sphere database and the indices it had developed to make the database easier to search. Moreover, the company is releasing the code for an internal tool called distributed-faiss. The tool makes it possible to run indices across multiple servers rather than on a single machine, which streamlines processing.
Meta believes that its AI system, the Sphere dataset and the other components developed by its engineers as part of the project could support multiple use cases in the future. “These models are the first components of potential editors that could help verify documents in real time,” the company detailed. “In addition to proposing citations, the system would suggest auto-complete text — informed by relevant documents found on the web — and offer proofreading corrections.”