Reusing the Model and Components of an IIR Study for Perceived Effects of OCR Quality Change

Kimmo Kettunen, Heikki Keskustalo, Birger Larsen, Tuula Pääkkönen, Juha Rautiainen

Research output: Other contributionpeer-review

Abstract

Historical newspapers are increasingly accessed digitally for
different purposes both by professional and lay users. These evergrowing historical collections are usually formed by utilizing
Optical Character Recognition (OCR), which may introduce noise
to the texts. This subsequently leads to compromised information
retrieval (IR) performance and user understanding. The effect of
OCR noise on IR performance has been studied earlier by utilizing
artificially degraded OCR quality texts (see, e.g., [2, 15]), test
collection containing documents with authentic low OCR quality
[12], or by gathering end-user impressions [23]. However, it
remains challenging to measure how the user’s subjective
perception is affected by the amount of OCR noise remaining in the
documents. Recently, the National Library of Finland has set up an
experimental system which allows studying this issue. The system
allows presenting each underlying historical document as two
alternatives – either based on the baseline OCR quality, or on the
new, improved OCR quality. This set up facilitates studying the
effects of OCR quality changes on the user’s subjective perception
of the document.
Following Gäde et al. [8] we describe in this paper the research
design, infrastructure, and research data utilized in a recent user
experiment of Kettunen et al. [19] entailing thirty-two test subjects
performing simulated work tasks [4] and discuss the prospects of
reuse of the experimental components of the study. So far, the
system has been used in one experiment in which the subjects
performed simulated tasks. However, the research design and its
general model could be utilized in the future to study the effects of
OCR quality on professional settings entailing historians
performing naturalistic phases of their research tasks.
CCS CONCEPTS
Document representation Users and interactive retrieval Evaluation
of retrieval results Task models Search interfaces
KEYWORDS
OCR quality, Interactive Information Retrieval, Evaluation,
Simulated Work Task, Historical newspaper collections, User
Study, Resource reuse
ACM Reference format:
Kimmo Kettunen, Heikki Keskustalo, Birger Larsen, Tuula Pääkkönen and
Juha Rautiainen. 2022. Reusing the Model and Components of an IIR Study
for Perceived Effects of OCR Quality Change. In Proceedings of Third
Workshop on Building towards Information Interaction and Retrieval
Resources Re-use (BIIRR 2022). ACM, New York, NY, USA, 7 pages.
https://doi.org/XXX
1 Introduction
It is well known that OCR noise present in digitized historical
documents disturbs end user perception of documents. However,
this impinging on the desired access is difficult to study [6, 27]. In
this paper, we describe a research design intended to allow studying
this issue and discuss the model and its components from the point
of view of reuse in research.
Digitized historical newspaper collections are produced
increasingly in different parts of the world and their usage is
expected to increase in the future. Access to information in these
collections is valuable to various stakeholders such as professional
historians, journalists, teachers, and ordinary citizens. To create
Original languageEnglish
Number of pages7
DOIs
Publication statusPublished - May 2022
Publication typeNot Eligible

Fingerprint

Dive into the research topics of 'Reusing the Model and Components of an IIR Study for Perceived Effects of OCR Quality Change'. Together they form a unique fingerprint.

Cite this