knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

rmgarbage: automatic removal of garbage strings in OCR text

R build status Lifecycle: experimental

The goal of rmgarbage is to remove strings obtained from OCR engines which are garbage. It contains functions that implement the methods described by:

The code was inspired by Python code at https://github.com/foodoh/rmgarbage and JavaScript code at https://github.com/Amoki/rmgarbage.

Installation

You can install rmgarbage from GitHub with:

remotes::install_github("benmarwick/rmgarbage")

Example

This is a basic example which shows you how to solve the problem of identifing bad OCR.

library(rmgarbage)

Here is an example of output on a good ocr:

good_ocr <- "This document was scanned perfectly"
good_ocr_split <- strsplit(good_ocr, " ")[[1]]
sapply(good_ocr_split, rmgarbage)

And here is an example of output on a bad ocr:

bad_ocr <- "This 3ccm@nt w&s scnnnnd not pe&;c1!y"
bad_ocr_ocr_split <- strsplit(bad_ocr, " ")[[1]]
sapply(bad_ocr_ocr_split, rmgarbage)

Contributing

If you would like to contribute to this project, please start by reading our Guide to Contributing. Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.



benmarwick/rmgarbage documentation built on April 19, 2020, 6:06 p.m.