Customised OCR Correction for Historical Medical Text

Thompson, Paul; Mcnaught, John; Ananiadou, Sophia

Customised OCR Correction for Historical Medical Text

dc.contributor.author	Thompson, Paul	en_US
dc.contributor.author	Mcnaught, John	en_US
dc.contributor.author	Ananiadou, Sophia	en_US
dc.contributor.editor	Gabriele Guidi and Roberto Scopigno and Fabio Remondino	en_US
dc.date.accessioned	2016-01-06T08:14:10Z
dc.date.available	2016-01-06T08:14:10Z
dc.date.issued	2015	en_US
dc.description.abstract	Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, owing to large-scale digitisation efforts. Searchable access is typically provided by applying Optical Character Recognition (OCR) software to scanned page images. Often, however, the automatically recognised text contains a large number of errors, since OCR systems are typically optimised to deal with modern documents, and can struggle with historical document features, including variable print characteristics and archaic vocabulary usage. Low quality OCR text can reduce the efficiency of search systems over historical archives, particularly semantic systems that are based on the application of sophisticated text mining (TM) techniques. We report on a new OCR correction strategy, customised for historical medical documents. The method combines rule-based correction of regular errors with a medically-tuned spellchecking strategy, whose corrections are guided by information about subject-specific language usage from the publication period of the article to be corrected. The performance of our method compares favourably to other OCR post-correction strategies, in improving word-level accuracy of poor-quality documents by up to 16%.	en_US
dc.description.sectionheaders	Full Papers	en_US
dc.description.seriesinformation	International Congress on Digital Heritage - Theme 1 - Digitization And Acquisition	en_US
dc.identifier.doi	10.1109/DigitalHeritage.2015.7413829	en_US
dc.identifier.isbn	978-1-5090-0048-7	en_US
dc.identifier.uri	https://doi.org/10.1109/DigitalHeritage.2015.7413829	en_US
dc.publisher	IEEE	en_US
dc.subject	OCR correction	en_US
dc.subject	medical history	en_US
dc.subject	historical document search	en_US
dc.subject	spell checking	en_US
dc.subject	text mining	en_US
dc.title	Customised OCR Correction for Historical Medical Text	en_US

Collections

DH2015 - Track 1

Customised OCR Correction for Historical Medical Text

Files

Collections