Tue, 12.06 2018 —Trading zone part 3: Text Re-use Detection. This blog post is the last part of Stepping in the NLP / History trading zone: a series of posts.
Working definition
Text re-use is essentially the meaningful reiteration of text, usually beyond the simple repetition of common language. It is such a broad concept that it can be understood at different levels and studied in a large variety of contexts.
In the context of publishing or teaching, instances of text re-use can constitute plagiarism should portions of someone else’s text be repeated without appropriate attribution. Or, just to give another example, in the context of literary studies text re-use is often just a synonym for literary phenomena like allusions, paraphrases and direct quotations.
Text re-use in Historical newspapers
When it comes to historical newspapers, the automatic detection of text re-use can help us to track down articles that are similar within a potentially very large pool of newspaper pages or articles (millions of them). Yet, the reasons of this similarity may vary, and often closer scrutiny is required to explain such similarity:
But let’s now see three examples of text re-use that were automatically detected by running the Gazette de Lausanne (GDL) and Journal de Genève (JDG) through a software called passim.
The first example is the news of an accident that involved a steamboat departed from Ouchy in direction of Geneva. The news appears first on the GDL (3 December 1863) and a couple days later on the JDG (5 December). The articles are identical, despite for a few small differences, like the fact that the JDG reports the name of the boat (Guillaume-Tilt). What is interesting from a more technical point of view is that the software detects the similarity between the two articles notwithstanding some substantial differences in the respective texts, due to the varying quality of the OCR.
The second example is more recent (28 April 1980) and it’s a press agency release reprinted almost identically on the same day by both GDL and JDG.
The last example represents the sort of phenomena captured by automatic text re-use detection that we believe will be most interesting for historians. It’s an excerpt of two articles, published by the two newspapers on the same day (26 May 1900), reporting the declarations of a French general, Gaston de Galliffet. Two things are interesting here: first, in both articles’ OCR the general’s name is crippled (de Gallippet / do Galliffet) meaning that a string-based search will have most likely missed them; second, the GDL article reports the direct speech attributed to de Galliffet, while the JDG paraphrases it. This sort of phenomena can be interesting to historians as they may reveal how the same historical event was portrayed and reported about by different newspapers, which may have very different political orientations.
Challenges
To conclude, let us consider briefly the technical challenges raised by the automatic detection of text re-use in the vast corpus of impresso newspapers:
Romanello, Matteo. Text re-use detection in a nutshell, Blog post, impresso, 2018 <http://impresso-project.ch/news/2018/06/12/tradingzone-tr.html>.
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.
A Creative Commons Attribution-NoDerivatives 4.0 (CC BY-ND 4.0) license applies to all contents published in impresso. While articles published on impresso can be copied by anyone for noncommercial purposes if proper credit is given, all materials are published under an open-access license with authors retaining full and permanent ownership of their work.