De-duplication and error correction of OCR-processed index cards using optimized string metrics

Thesis Type Master
Thesis Status
Student Markus Ruepp
Thesis Supervisor

This thesis will discuss methods for deduplication and error correction of OCR-generated texts.
For this purpose, several state of the art string metrics will be analyzed and evaluated. Based on this evaluation a new approach for string comparison is developed and in a further step a prototype for the University library of Innsbruck needs to be implemented, which should
deduplicate and correct errors of millions of type written index cards.
Therefore queries generated of the index cards are issued against the online library catalogue WorldCat. The retrieved search results are compared against the cards, whereby the best match is then be used to correct the index card successfully.