Customised OCR Correction for Historical Medical Text

dc.contributor.authorThompson, Paulen_US
dc.contributor.authorMcnaught, Johnen_US
dc.contributor.authorAnaniadou, Sophiaen_US
dc.contributor.editorGabriele Guidi and Roberto Scopigno and Fabio Remondinoen_US
dc.date.accessioned2016-01-06T08:14:10Z
dc.date.available2016-01-06T08:14:10Z
dc.date.issued2015en_US
dc.description.abstractHistorical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, owing to large-scale digitisation efforts. Searchable access is typically provided by applying Optical Character Recognition (OCR) software to scanned page images. Often, however, the automatically recognised text contains a large number of errors, since OCR systems are typically optimised to deal with modern documents, and can struggle with historical document features, including variable print characteristics and archaic vocabulary usage. Low quality OCR text can reduce the efficiency of search systems over historical archives, particularly semantic systems that are based on the application of sophisticated text mining (TM) techniques. We report on a new OCR correction strategy, customised for historical medical documents. The method combines rule-based correction of regular errors with a medically-tuned spellchecking strategy, whose corrections are guided by information about subject-specific language usage from the publication period of the article to be corrected. The performance of our method compares favourably to other OCR post-correction strategies, in improving word-level accuracy of poor-quality documents by up to 16%.en_US
dc.description.sectionheadersFull Papersen_US
dc.description.seriesinformationInternational Congress on Digital Heritage - Theme 1 - Digitization And Acquisitionen_US
dc.identifier.doi10.1109/DigitalHeritage.2015.7413829en_US
dc.identifier.isbn978-1-5090-0048-7en_US
dc.identifier.urihttps://doi.org/10.1109/DigitalHeritage.2015.7413829en_US
dc.publisherIEEEen_US
dc.subjectOCR correctionen_US
dc.subjectmedical historyen_US
dc.subjecthistorical document searchen_US
dc.subjectspell checkingen_US
dc.subjecttext miningen_US
dc.titleCustomised OCR Correction for Historical Medical Texten_US
Files
Collections