Historical newspaper collections are among the richest sources of information for humanities research. Many of these collections have been digitized and automatically transcribed via OCR in the past years to allow digital access to historical material. However, extracting specific articles from collections, finding related images, and linking related articles is often time-consuming and labor-intensive. Automatic article extraction and semantic enrichment of these historical newspaper collections would greatly improve their accessibility – this is exactly what this project investigates in the DATA-KBR-BE project, i.e, AI-based methods to facilitate article-level search in historical collections.
This annotation pipeline aims to enhance searchability with semantic data enrichment and cross collection linkage of information. The NewspAIper demo, which is based in this pipeline, allows the user to interactively query the collection. Starting from the given OCR results of the collection, our pipeline performs article segmentation, named entity recognition, and semantic linking.
Using the NewspAIper platform, humanities researchers can more easily extract information relevant to their research interests from the collection. Moreover, it facilitates interactive filters based on article date, language, found entities, and allows users to browse similar articles and illustrations. The pipeline will be improved and used to construct corpora for subsequent research. As future work, this project wants to improve the semantic text enrichment by providing both topic detection and trend analysis (detect subsequent articles on the same topic). Furthermore, toponyms found in the articles could be used to geolocate the historical images. These improvements will allow for even more fine-grained filtering capabilities.
#LTMThursday interview
Steven Verstock (IDLab Ghent University) shares all the information on the aims of this project, what the demonstrator is already able to offer and where it’s development is headed.