check_embed: Check if text embed is not from OCR
In jacob-ogre/ocrerrors: Find Optical Character Recognition Errors and Corrections

Description Usage Arguments Value See Also Examples

View source: R/gold_std.R

Some PDFs have an embedded text layer that is derived from OCR by the scanner or other equipment that produced the PDF. Such documents will likely have OCR artifacts that will contaminate the 'gold standard' that is needed for error correction. The gold standard texts should only come from PDFs derived directly from the original document (e.g., .docx).

1	check_embed(file)

file

Path to a PDF to check for embedding source

Logical: TRUE if good embed, FALSE if from OCR

pdftools::pdf_info

1	# res <- summarize_gold("test.pdf", text)

jacob-ogre/ocrerrors documentation built on May 18, 2019, 8:01 a.m.

jacob-ogre/ocrerrors index

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com