Accurate pattern-based extraction of complex Gleason score expressions from pathology reports

Joonas Miettinen, Tomas Tanskanen, Henna Degerlund, Aapeli Nevala, Nea Malila, Janne Pitkäniemi

Research output: Contribution to journalArticleScientificpeer-review


Purpose: The Gleason score is an important grading factor of prostate cancer. Gleason scores can be extracted from pathology report texts using regular expressions, but previously developed programmes have targeted only relatively simple Gleason score expressions. We developed a programme capable of extracting also complex expressions. The programme is relatively easy to adapt to other languages and datasets. Methods: We developed and evaluated our regular expression-based programme using manually processed pathology reports of prostate cancer cases diagnosed in Finland in 2016–2017. Both simple and complex Gleason score expressions were targeted. We measured the performance of our programme using recall, precision, and the F1. The proportion of complex Gleason score expressions was estimated as the complement of the recall when only addition expressions (e.g. “Gleason 3 + 4”) were targeted. Results: The detection of values (scores and score components) is based on mandatory keywords before or after the value. The programme favours precision over recall by primarily allowing for lists of optional expressions between keyword-value pairs and only secondarily allowing for arbitrary expressions. The programme is straightforward to adapt to new datasets by modifying the lists of mandatory and optional expressions. The full and addition-only programmes had 92% (95% CI: [90%, 95%]) and 65% ([61%, 70%]) recall and high precision (98% [97%, 99%] and 100% [99%, 100%]), respectively. The estimated proportion of complex Gleason score expressions was 100–65 = 35%. Conclusions: Even complex Gleason score expressions can be extracted with high recall and precision using regular expressions. We recommend implementing automated Gleason score extraction where possible by adapting our validated programme.

Original languageEnglish
Article number103850
JournalJournal of Biomedical Informatics
Publication statusPublished - Aug 2021
Publication typeA1 Journal article-refereed


  • Free-form text
  • Gleason score
  • Information extraction
  • Natural language processing
  • Pathology report
  • Regular expression

Publication forum classification

  • Publication forum level 1

ASJC Scopus subject areas

  • Computer Science Applications
  • Health Informatics


Dive into the research topics of 'Accurate pattern-based extraction of complex Gleason score expressions from pathology reports'. Together they form a unique fingerprint.

Cite this