jacob-ogre/ocrerrors: Find Optical Character Recognition Errors and Corrections

Optical Character Recognition (OCR) works well when the input file is a clean, high-resolution image/document. But many documents exist in digital form only as (often old) low-resolution or "messy" PDFs/images. Pre- processing images can help improve OCR accuracy, but oftentimes those steps are of limited utility. Post-processing an OCR document can substantially improve accuracy, and this processing can be informed by determining the distribution of word frequencies in a corpus and by identifying common errors. This package contains a set of tools for getting the distributions of n-grams from a 'gold standard' set of input PDFs with a text layer; simulating low-quality image-based PDFs from the gold set; and identifying errors that arise from OCR of the low-quality, image-based PDFs.

Getting started

Package details

LicenseBSD_2_clause + file LICENSE
URL https://github.com/jacob-ogre/ocrerrors
Package repositoryView on GitHub
Installation Install the latest version of this package by entering the following in R:
jacob-ogre/ocrerrors documentation built on May 18, 2019, 8:01 a.m.