Abstract
Purpose: The Gleason score is an important grading factor of prostate cancer. Gleason scores can be extracted from pathology report texts using regular expressions, but previously developed programmes have targeted only relatively simple Gleason score expressions. We developed a programme capable of extracting also complex expressions. The programme is relatively easy to adapt to other languages and datasets. Methods: We developed and evaluated our regular expression-based programme using manually processed pathology reports of prostate cancer cases diagnosed in Finland in 2016–2017. Both simple and complex Gleason score expressions were targeted. We measured the performance of our programme using recall, precision, and the F1. The proportion of complex Gleason score expressions was estimated as the complement of the recall when only addition expressions (e.g. “Gleason 3 + 4”) were targeted. Results: The detection of values (scores and score components) is based on mandatory keywords before or after the value. The programme favours precision over recall by primarily allowing for lists of optional expressions between keyword-value pairs and only secondarily allowing for arbitrary expressions. The programme is straightforward to adapt to new datasets by modifying the lists of mandatory and optional expressions. The full and addition-only programmes had 92% (95% CI: [90%, 95%]) and 65% ([61%, 70%]) recall and high precision (98% [97%, 99%] and 100% [99%, 100%]), respectively. The estimated proportion of complex Gleason score expressions was 100–65 = 35%. Conclusions: Even complex Gleason score expressions can be extracted with high recall and precision using regular expressions. We recommend implementing automated Gleason score extraction where possible by adapting our validated programme.
Original language | English |
---|---|
Article number | 103850 |
Journal | Journal of Biomedical Informatics |
Volume | 120 |
DOIs | |
Publication status | Published - Aug 2021 |
Publication type | A1 Journal article-refereed |
Keywords
- Free-form text
- Gleason score
- Information extraction
- Natural language processing
- Pathology report
- Regular expression
Publication forum classification
- Publication forum level 1
ASJC Scopus subject areas
- Computer Science Applications
- Health Informatics