In benmarwick/rmgarbage: Automatic garbage extraction from OCR'd text

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

rmgarbage: automatic removal of garbage strings in OCR text

The goal of rmgarbage is to remove strings obtained from OCR engines which are garbage. It contains functions that implement the methods described by:

Taghva et al. (2001) "Automatic Removal of Garbage Strings in OCR Text: An implementation" http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.81.8901
Yang Cai (2008) "OCR Output Enhancement" https://ladyissy.github.io/OCR/

The code was inspired by Python code at https://github.com/foodoh/rmgarbage and JavaScript code at https://github.com/Amoki/rmgarbage.

Installation

You can install rmgarbage from GitHub with:

remotes::install_github("benmarwick/rmgarbage")

Example

This is a basic example which shows you how to solve the problem of identifing bad OCR.

library(rmgarbage)

Here is an example of output on a good ocr:

good_ocr <- "This document was scanned perfectly"
good_ocr_split <- strsplit(good_ocr, " ")[[1]]
sapply(good_ocr_split, rmgarbage)

And here is an example of output on a bad ocr:

bad_ocr <- "This 3ccm@nt w&s scnnnnd not pe&;c1!y"
bad_ocr_ocr_split <- strsplit(bad_ocr, " ")[[1]]
sapply(bad_ocr_ocr_split, rmgarbage)

Contributing

If you would like to contribute to this project, please start by reading our Guide to Contributing. Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

benmarwick/rmgarbage documentation built on April 19, 2020, 6:06 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com