Advertisement

Active Hackathon

Creating A ML Solution That Accurately Extracts Quotes From News Articles

The Guardian recently announced that it has joined forces with Agence France-Presse (AFP) to work on a machine learning solution that accurately extracts quotes from news articles and matches them with the right source.

The Guardian recently announced that it has joined forces with Agence France-Presse (AFP) to work on a machine learning solution that accurately extracts quotes from news articles and matches them with the right source. The company says that the existing solutions did not work that well on their content, and the models struggled to recognise quotes that did not match a classic pattern. Some models were returning too many false positives and identifying generic statements as quotes.

Co-referencing, or the process of establishing the source of a quote by finding the correct reference in the text, was also an issue, especially when the source’s name was mentioned in several sentences or even paragraphs before the quote itself. 

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

To train a model to identify quotes in the text, the company used two tools created by Explosion –  Spacy, one of the leading open-source libraries for advanced natural language processing using deep neural networks, and Prodigy, an annotation tool that provides an easy-to-use web interface for quick and efficient labelling of training data.

Together with AFP, the team manually annotated more than 800 news articles with three entities: content (the quote, in quotation marks), source (the speaker, which might be a person, an organisation, etc), and cue (usually a verb phrase, indicating the act of speech or expression).

The main challenge in building the training dataset was navigating the ambiguity of different journalistic styles. The first batch of annotations turned out to be quite noisy and inconsistent, but the team were getting better and better with each iteration.

The model correctly identified all three entities (content, source, cue) in 89% of cases. Considering each entity separately, content scored the highest (93%), followed by a cue (86%) and source (84%).

The company says that it looks forward to building a robust co-reference resolution system and exploring further deep learning. Challenges such as identifying meaningful quotes and content will also be addressed. 

More Great AIM Stories

Victor Dey
Victor is an aspiring Data Scientist & is a Master of Science in Data Science & Big Data Analytics. He is a Researcher, a Data Science Influencer and also an Ex-University Football Player. A keen learner of new developments in Data Science and Artificial Intelligence, he is committed to growing the Data Science community.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR