The intelligent search in digital newspaper archives
Modern magazine and publishing houses often have large databases of publications, which increasingly need to be digitized to be ready for comprehensive research. However, newspapers and magazines that date from before the computer age are often only available in paper format. Scanned and digitized in large-scale projects with a lot of effort this may turn out to be a time-consuming task. Most of the time these projects stop with the mere digitization, leaving the actual information conversion untouched.
Together with the Hessian PPS PREPRESS SYSTEME GmbH, intergator has been adapted to the requirements of a research in newspaper archives and magazines. With the experience from enterprise search projects on the one hand and the know-how from digitization processes on the other, the PPS_Finder was turned into a comprehensive research tool.
The research in detail
As with every research, the central point of access is a plain search form holding the term to be searched for. The results will then be listed neatly, whereas every hit features some META data and a preview image. Since all documents are automatically digitized using OCR technology and transformed into a PDF afterward, the paper is instantly accessible in the built-in reader. The search term will be color-highlighted, easing the location of the finding within he document for the user. The search and its results remain untouched (a new search is not necessary) and can be continued straight away.
With the help of freely configurable built-in facets, results can be narrowed down to meaningful findings. Aside from volumes, sources (e.g. if a newspaper is issued in various local versions), authors etc. even several desks can be filtered for a detailed research. The selected facets can be deactivated with a simple click leaving the search untouched. In addition, the facets can also be used for exclusions – if, for example, you are looking for a term and want to explicitly avoid a year, this can be excluded as an excluding facet. Additional facets, such as filtering according to agency reports or regional editions, are possible in principle.
During indexing each element is added with a annotation. At a later stage – while researching – results can be manually enriched with individual tags. This way, related results can be added to future searches. The tags remain with the search result but also can be removed at any time. This way, more complex topics that require a certain context become easier to research.
Aside from searching through a basic input field, there is also an extended search featuring additional fields to narrow down the results right at the start. These parameters – just like at Google – can also be added to the basic search as special shortcuts. Searching for a certain file type. a date, an author, etc. help to speed up the process for a trained user. So called Boolean operators (AND, OR, NOT, etc.) can be used as well.
Machine learning as a new approach
Using machine learning methods, documents can be enriched for a more efficient search. Instead of simply searching for the appearance of a search term within a document, the data will be “understood” in its context using advanced cognitive search technologies. This is a radical game changer and a new approach. Conventional search heavily relied on densities and appearances of terms. Now, the trained machine autonomously scans the data, recognizes the relevant words and sets them into relation to each other. After a short training, the machine develops a model to decide on its own what the document is about and how to categorize it. The results of it will automatically added to the data, enriching the document with additional content like META data.
Even pictures can be meaningful added to the search with the help of machine learning methods. Until today, images had to tagged manually stating what the picture shows. A trained machine now scans the content without human interference although an initial training is required.
Especially large amounts of data can be categorized, researched and related to each other. Publishing houses with extensive archives and image stocks that in the past only concentrated on digitization but not on the utilization will find a useful researching tool in the PPS_Finder. The development of machine learning methods has gained a lot of traction in the last couple of years and some elements are already built-in into intergator and the PPS_Finder.