Moving beyond digital filters. How to integrate the digitised press into the historian’s workflow

Julien Nguyễn Đăng Fri, 06.07 2018 — 

impresso Laurel workshop, École polytechnique fédérale de Lausanne (EPFL)

The second impresso workshop (Laurel) took place on 5 and 6 July on the EPFL campus in Lausanne. It was preceded by a seminar dedicated to the practice of history in the digital age (L’histoire contemporaine à l’ère numérique: sources, méthodologies, critiques, #JEhistNUM) held on the neighbouring UNIL campus and organised by Frédéric Clavert (C2DH), Solenn Huitric (UNIL), Enrico Natale ( and Raphaëlle Ruppen-Coutaz (UNIL). A report will be made available via During this UNIL seminar, historians and archivists discussed the changes that have occurred with the advent of the digital era and presented their research and document collection practices in the specific fields of diplomatic documents, born-digital web archives and, of course, digitised newspapers. The third type of historical sources was represented by Guillaume Pinson from the Numapresse project and Estelle Bunout from impresso. The dialogue between both projects fuelled discussions over the following days, reflecting the underlying common challenges and complementary approaches.

En route to the impresso workshop!

The impresso workshop was attended by impresso members and associated partners, who came together to discuss project-related topics, continue the work on historical research scenarios and share their views on the first interface prototype and mock-ups. As well as interface co-design, attention was given to interdisciplinary exchanges on questions including press historiography and teaching with newspaper interfaces, with the participation of infoclio, the Numapresse team and the UNIL contemporary history department. The workshop was divided into three sessions, which are detailed in the following sections.

Research on digitised collections of newspapers preceded by 100 years of press historiography

The first session, chaired by Enrico Natale, looked back at the emergence, sedimentation and transformation of the historiography of the press, starting in the 19th century and leading to the momentous “civilisation of the newspapers” project coordinated by Dominique Kalifa, Philippe Régnier, Marie-Ève Thérenty and Alain Vaillant in 2011, before raising the question of the transposition of this rich academic tradition to corpora of digitised newspapers.

Marie-Ève Thérenty, a professor of French literature at Université Paul-Valéry Montpellier III and a specialist in media history who supervises the Numapresse project, helped place the impresso project in a broader context by presenting the historiography of the French media. This study of the press in France mainly dealt with two aspects: legal matters and freedom of the press on the one hand, and the legitimacy and morality of press influence on the other. After numerous scientific publications in the 1920s and 1930s about newspapers as autonomous and legitimate political actors, a new era began after 1945 with a scientific approach to newspaper history. Monographs on specific periodicals were published, historians began to question newspaper content and broad summaries were written. In the 1980s, new aspects other than politics started to be considered, in particular literature and poetic language but also networks, circulation and sociability. Considering the present situation, Marie-Ève Thérenty identified three challenges that are currently underpinning the multiple research projects in the field. First, the primary source has to be analysed, which involves reflecting on the mass digitisation of the press and its advantages for researchers. Second, international cooperation is needed to carry out such vast projects. Finally, online research platforms gather the results and make them available to the whole scholarly community.

Marie-Ève Thérenty on French press historiography

In this context, an eight-year project known as La civilisation du journal, involving 60 researchers, was carried out to produce a cultural, literary and social history of the 19th-century French press. This gave a real boost to research on the topic. In parallel, Médias 19, an ongoing French-Canadian project, has followed this trend by highlighting the media culture of the 19th century. Among other things, this project contains a database of biographical notes. The Numapresse project was set up to write a lengthy history of the press, up to the era of online press, using digitised corpora from libraries and universities and text mining techniques.

Pierre-Carl Langlais, holder of a PhD in Communication and Information Sciences and a researcher at Université Paul-Valéry Montpellier III, then presented the Numapresse tools. Innovative text mining techniques were applied to a corpus of 1.5 Tb of digitised newspapers from the French National Library (Bibliothèque nationale de France), via the Huma-Num research infrastructure. In order to classify newspapers by genre or form, a selection of corpora underwent supervised modelling based on a manual classification – the taxonomy will be developed on a continuous basis. Lexical sketches of each genre were formed before an algorithm automated the recognition. This offered a retrospective view of the history of each genre from multiple perspectives (e.g. time, position within newspaper, etc.) and the various corpora were generated on demand, including corpora of images. Metadata also proved useful in providing a sociological and economic history of journalists. Another important step was taken in the area of viral text: by analysing the reproduction of articles in other periodicals or even books, networks can be identified and examined, but this remains quite difficult to study on a multilingual level. All articles in Numapresse undergo an enrichment process and metadata is added, resulting in the growth of an existing bibliographical database of places, figures and events.

Pierre-Carl Langlais on Numapresse tools and methods

After this insightful presentation of developments in press historiography by researchers involved in the Numapresse project and a reminder of impresso’s main objectives by Maud Ehrmann, the floor was opened for discussion between participants.

How to bring digitised newspapers to students of both humanities and computer science

The second session, chaired by Frédéric Clavert (C2DH), Solenn Huitric (UNIL) and Raphaëlle Ruppen-Coutaz (UNIL), focused on the use of digitised newspaper interfaces in the classroom. All participants were asked to take part in a working group to prepare a history class based on the archives of La Gazette de Lausanne and Le Journal de Genève. While Group A chose violence in Switzerland as a general topic for personal student research projects and adopted a teaching approach involving the progressive incorporation of digital humanities techniques, Group B decided to emphasise the teaching of computer science tools, and Group C – who chose “women in newspapers” as a main subject – focused on both history and computer science in their class.

Group C exposing their work at the end of the UNIL workshop session

This range of different approaches reveals the complexity and potentialities in including digital humanities in history classes at universities. The session chairs spoke about their experience of teaching digital humanities to EPFL, UNIL and University of Luxembourg students from various backgrounds, sometimes with the involvement of DH researchers. They observed a surprising lack of computer skills among students, meaning that computer science techniques needed to be incorporated into the teaching. It is also important to draw students’ attention to the effect of tools and the filters they created between themselves and the historical sources, in order to retain a critical approach, which is vital for research. The lecturers also identified differences between students: while interpreting results posed a challenge to non-history students, building and processing corpora remained cumbersome and time-consuming for history students. In short, the teaching was appreciated and, using these new tools, the students subjected the digitised sources to source criticism – a vital part of historians’ training and a relevant novelty for non-history students. The end of the session provided an opportunity to draft a broad list of the concepts, skills and tools needed by teachers and students for this kind of lecture.

Christopher Morse presenting his exciting virtual reality (VR) newspaper browser

At last, a chance to try out the interface!

The final session reviewed the early achievements of impresso, especially in the areas of data processing and the initial interface prototype and mock-ups. As Maud Ehrmann explained, impresso is in the process of acquiring 221 Swiss and Luxembourgish newspapers. Once the data is on servers, many pre-processing steps will occur (converting text and images to a canonical format, ingestion into a database), before indexing and various text mining techniques can actually happen. Designing and setting up a system architecture that is able to coordinate many processes developed by three geographically distant teams is challenging but also very exciting. As well as core functionalities, particular emphasis will be placed on security, scalability and transparency.
Phillip Ströbel presented the challenge of OCR quality assessment by looking at the case of the Neue Zürcher Zeitung. Using the Transkribus platform, the UZH team manually transcribed about 150 pages and trained a handwriting text recognition (HTR) model on 112 pages. Performances of the model are more than impressive, especially on low quality images and for Fraktur fonts.

Busy impresso team

impresso involves not only computer scientists and designers but also historians, who were asked to present historical research scenarios that they will work on using the impresso interface. Estelle Bunout presented her project on charting the anti-European posture in the public debate in Switzerland and Luxembourg (1848-1945), whereas Marten Düring used as an example the coverage of the Battle of Arnhem in European newspapers since 1944. Julien Nguyễn Đăng, an intern at the C2DH, chose to study the representation of extraterrestrials in newspapers, while Solenn Huitric reflected upon the funding of secondary education in 19th-century Europe and the development of a public debate on the topic. Enrico Natale presented newspapers as a relevant source for piecing together a history of computing in Switzerland. Tobias von Waldkirch, from the University of Basel, introduced his PhD research on 19th-century journalism cultures based on a comparison between the Neue Zürcher Zeitung and the Journal de Genève. Finally, Gerold Schneider, a computational linguist at the University of Zurich, shared his work on the history of thought in medicine from scholasticism to rationalism and the use of topic modelling to exploit large collections of texts.

Enrico Natale on newspapers and the history of computing in Switzerland

Then Daniele Guido, the lead designer of the impresso project, presented the first impresso interface prototype, based on UZH and DHLAB inputs and co-developed with Thijs van Beek and Paul Schroeder. He started by showing the search tool and its filter and visualisation options. He then presented the viewer that provides users with a focused reading by displaying a whole page of the newspaper, enriched with associated metadata and enabling exportation, in particular into citations. Users can save articles into collections.

Daniele Guido presenting impresso’s interface

All participants were given the opportunity to test the prototype, which meant that direct feedback could be collected on visualisation, multilingual characteristics and accessibility issues. These observations, which reflect the intrinsic synergistic identity of impresso, will help computer scientists and designers build the most useful interface possible.

Time to discover impresso’s interface!

This first exchange on the prototype of the impresso interface, embedded in the discussion on the historiography of the press and the use of the digitised press in a classroom setting, proved very fruitful. The associated partners of impresso, the impresso team and the workshop guests were able to hear about each other’s experiences during the different sessions and to begin to understand and take into account these varying perspectives.

  • Workshop agenda
  • Presentation by Marie-Ève Thérenty on French press historiography
  • Presentation by Pierre-Carl Langlais on Numapresse tools and methods
  • Presentation by the impresso team on data acquisition, processing and interfaces

Moving beyond digital filters. How to integrate the digitised press into the historian’s workflow, Blog post, impresso, 2018 <>.