Leveraging an unprecedented corpus of newspaper and radio archives, Impresso - Media Monitoring of the Past is an interdisciplinary research project that uses machine learning to pursue a paradigm shift in the processing, semantic enrichment, representation, exploration and study of historical media across modalities, time, languages, and national borders.
Impresso seeks to transcend language and media barriers to enable, for the first time, the joint exploration of a vast corpus comprising both newspaper and radio archives. This initiative spans different modalities, time periods, languages, and national boundaries. Our objective extends beyond simply aggregating collections and enabling full-text search capabilities. We aim to enhance and interconnect these sources by implementing multiple layers of advanced semantic enrichments, all represented within a unified, multilingual vector space. Additionally, our goal is to develop robust, meaningful, and transparent exploration tools specifically tailored for historical research.
Our work represents not merely an expansion in volume, but a fundamental shift in the approach to processing, representing, and studying sources from a transmedia and transnational perspective. This project is collaboratively designing and developing an open, versatile technological framework to enable seamless exploration of semantically interconnected media archives. Our focus areas include:
- Advancing multilingual natural language processing techniques to convert diverse, unstructured, and noisy historical media sources into semantically enriched data, ultimately interconnected within a shared vector space.
- Progressing the field of digital (media) history, both in terms of research and methodologies.
- Creating and implementing innovative interfaces for exploring, visualizing, and comparing extensive amounts of enriched historical print and audio collections.
Project schema
Historical media reflect the times during which they were created and offer rich insights into past norms, values, knowledge horizons and the historicity of media themselves. Since the 1990s, newspaper and radio archives have undergone massive digitization, and traditional barriers hindering the study of historical media, namely difficult access and tedious exploration, have started to fall.
Millions of facsimiles and digital broadcast records, along with their machine-readable content, are now available for research. Existing tools for the exploration of digitized newspapers and radio broadcasts nevertheless remain in a fragmented landscape, where automatic processing and computational approaches are typically restricted to one language and one media type. These limitations severely hamper historical research, which is driven by the discovery of relations between their objects of study through iterative processes characterised by comparing, contrasting and associating sources and information.
Impresso establishes the legal and technical prerequisites to build a corpus of historical newspaper and radio collections from our partners. These are newspaper collections from Switzerland, Austria, Belgium, France, Germany, Luxembourg, the Netherlands and the UK spanning the time period of the 19th and the 20th centuries, and radio broadcasts from Switzerland, Austria, France, Germany, the Netherlands and the UK from the 1930s on. For newspapers this includes the collection of page scans, text in METS/ALTO and other XML formats and accompanying bibliographic metadata. For radio, this includes e.g. speaker typescripts, transcribed speech, audio content, radio programmes and accompanying metadata.
We seek to enhance and interconnect historical media collections using both mono- and multilingual natural language processing (NLP) and image processing components. These components help us consolidate, process, enrich, index, link, and connect historical sources.
Semantic enrichment
Our goal is to enrich the consolidated source collection with structured semantic information. This includes fine-grained Named Entity Recognition and Classification (NERC), reliable entity linking, and the extraction of people’s titles and professions. We also aim to determine who said what through the identification of interviews and quote extraction. Additionally, we plan to semantically qualify content units and detect themes using keyphrase extraction, text classification, and multilingual topic modeling. Image documents will be classified and undergo object detection. These enrichments support faceted search and historical research.
Semantic indexing and linking using multilingual dense vector spaces
To enable meaningful content comparison across languages, we will create a shared multilingual dense vector space for semantic indexing. This space will include representations for words, keyphrases, named entities, paragraphs, documents, and images. We will align embeddings between languages, enabling the retrieval of semantically similar content across languages. Hierarchical clustering will group related items, forming the basis for comparative timeline-based views. We will also interlink our content with externa