Dnext

October 27, 2022 3:53am

AI for modular journalism and human-in-the-loop workflows. "Along with colleagues from the Agence France-Presse (AFP), the team used Prodigy to manually annotate more than 800 news articles to identify three parts of quotes: source: the speaker which might be a person or an organization, cue: usually a verb phrase indicating the act of speech or expression, and content: the quote in quotation marks."

"The final step would include coreference resolution to define ambiguous references (e.g., pronouns like 'he' or 'she'). With such information, this model could structure data on quotes (e.g., what was the quote and who said it) to enable reuse of the quotes in different media formats."

This piece is a bit jargony, but basically, regular expression rules to extract quotes are inadaquate so they use a machine learning technique called named entity recognition, using a tool called Prodigy that enables them to train their own models. Named entity recognition trains a neural network to recognize proper nouns regardless of how many words they are, and doesn't get confused by things that fool simpler algorithms, like when some of the words in a named entity are also regular words that could show up by themselves in ordinary text.

Once the quotes are extracted, additional neural networks are trained to generate quotes in such a way that they conform to various style guides. So if you see quotes in news articles, they could have been automatically generated by an AI system, not the journalist whose name is on the piece -- though the journalist is supposed to check that it's correct and corrections are supposed to be "human-in-the-loop" training data for the neural network.

I quote a lot of stuff. Maybe I should use this?

How the Guardian approaches quote extraction with NLP

#solidstatelife #ai #nlp #quotes

How the Guardian approaches quote extraction with NLP · Explosion

A case study of the Guardian's spaCy-Prodigy workflow to modularize quote extraction for content creation. This study includes iterative annotation guidelines and custom interface functionality.

There are no comments yet.