ocr | R Documentation |
Extract text from an image. Requires that you have training data for the language you are reading. Works best for images with high contrast, little noise and horizontal text. See tesseract wiki and the package vignette for image preprocessing tips.
ocr(file, engine = tesseract("eng"), HOCR = FALSE, opw = "", upw = "")
ocr_data(file, engine = tesseract("eng"))
file |
file path or raw vector (png, tiff, jpeg, etc). |
engine |
a tesseract engine created with |
HOCR |
if |
opw |
owner password to open pdf (please pass it as an environment variable to avoid leaking sensitive information) |
upw |
user password to open pdf (please pass it as an environment variable to avoid leaking sensitive information) |
The ocr()
function returns plain text by default, or hOCR text if hOCR is set to TRUE
.
The ocr_data()
function returns a data frame with a confidence rate and bounding box for
each word in the text.
character vector of text extracted from the file. If the file is has TIFF or PDF extension, it will be a vector of length equal to the number of pages.
Other tesseract:
tesseract()
,
tesseract_download()
file <- system.file("examples", "test.png", package = "cpp11tesseract")
text <- ocr(file)
cat(text)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.