tesseract: Tesseract OCR

Description Usage Arguments Details References See Also Examples

Description

Extract text from an image. Requires that you have training data for the language you are reading. Works best for images with high contrast, little noise and horizontal text.

Usage

1
2
3
4
ocr(image, engine = tesseract("eng"))

tesseract(language = NULL, datapath = NULL, options = NULL,
  cache = TRUE)

Arguments

image

file path, url, or raw vector to image (png, tiff, jpeg, etc)

engine

a tesseract engine created with tesseract()

language

string with language for training data. Usually defaults to eng

datapath

path with the training data for this language. Default uses the system library.

options

a named list with tesseract engine options

cache

use a cached version of this training data if available

Details

Tesseract uses training data to perform OCR. Most systems default to English training data. To improve OCR performance for other langauges you can to install the training data from your distribution. For example to install the spanish training data:

On Windows and MacOS you can install languages using the tesseract_download function which downloads training data directly from github and stores it in a the path on disk given by the TESSDATA_PREFIX variable.

References

Tesseract training data

See Also

Other tesseract: tesseract_download

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Simple example
text <- ocr("https://jeroen.github.io/images/testocr.png")
cat(text)

## Not run: 
# Full roundtrip test: render PDF to image and OCR it back to text
curl::curl_download("https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf", "R-intro.pdf")
orig <- pdftools::pdf_text("R-intro.pdf")[1]

# Render pdf to png image
img_file <- pdftools::pdf_convert("R-intro.pdf", format = 'tiff', pages = 1, dpi = 400)

# Extract text from png image
text <- ocr(img_file)
unlink(img_file)
cat(text)

## End(Not run)

engine <- tesseract(options = list(tessedit_char_whitelist = "0123456789"))

tesseract documentation built on Aug. 15, 2017, 1:03 a.m.