knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
The goal of rmgarbage is to remove strings obtained from OCR engines which are garbage. It contains functions that implement the methods described by:
The code was inspired by Python code at https://github.com/foodoh/rmgarbage and JavaScript code at https://github.com/Amoki/rmgarbage.
You can install rmgarbage from GitHub with:
remotes::install_github("benmarwick/rmgarbage")
This is a basic example which shows you how to solve the problem of identifing bad OCR.
library(rmgarbage)
Here is an example of output on a good ocr:
good_ocr <- "This document was scanned perfectly" good_ocr_split <- strsplit(good_ocr, " ")[[1]] sapply(good_ocr_split, rmgarbage)
And here is an example of output on a bad ocr:
bad_ocr <- "This 3ccm@nt w&s scnnnnd not pe&;c1!y" bad_ocr_ocr_split <- strsplit(bad_ocr, " ")[[1]] sapply(bad_ocr_ocr_split, rmgarbage)
If you would like to contribute to this project, please start by reading our Guide to Contributing. Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.