jacob-ogre/ocrerrors: Find Optical Character Recognition Errors and Corrections

Optical Character Recognition (OCR) works well when the input file is a clean, high-resolution image/document. But many documents exist in digital form only as (often old) low-resolution or "messy" PDFs/images. Pre- processing images can help improve OCR accuracy, but oftentimes those steps are of limited utility. Post-processing an OCR document can substantially improve accuracy, and this processing can be informed by determining the distribution of word frequencies in a corpus and by identifying common errors. This package contains a set of tools for getting the distributions of n-grams from a 'gold standard' set of input PDFs with a text layer; simulating low-quality image-based PDFs from the gold set; and identifying errors that arise from OCR of the low-quality, image-based PDFs.

Vignettes Man pages API and functions Files

Package details
Maintainer
License	BSD_2_clause + file LICENSE
Version	0.1.0
URL	https://github.com/jacob-ogre/ocrerrors
Package repository	View on GitHub
Installation	Install the latest version of this package by entering the following in R: `install.packages("remotes") remotes::install_github("jacob-ogre/ocrerrors")`