tessdata: Tesseract Training Data
In cpp11tesseract: Open Source OCR Engine

tesseract_download

R Documentation

Tesseract Training Data

Description

Helper function to download training data from the official tessdata repository. On Linux, the fast training data can be installed directly with yum or apt-get.

Helper function to download training data from the contributed tessdata_contrib repository.

Usage

tesseract_download(
  lang,
  model = c("fast", "best"),
  datapath = NULL,
  progress = interactive()
)

tesseract_contributed_download(
  lang,
  model = c("fast", "best"),
  datapath = NULL,
  progress = interactive()
)

Arguments

`lang`	three letter code for language, see tessdata repository.
`model`	either `fast` or `best` is currently supported. The latter downloads more accurate (but slower) trained models for Tesseract 4.0 or higher
`datapath`	destination directory where to download store the file
`progress`	print progress while downloading

Details

Tesseract uses training data to perform OCR. Most systems default to English training data. To improve OCR performance for other languages you can to install the training data from your distribution. For example to install the spanish training data:

tesseract-ocr-spa (Debian, Ubuntu)
tesseract-langpack-spa (Fedora, EPEL)

On Windows and MacOS you can install languages using the tesseract_download function which downloads training data directly from github and stores it in a the path on disk given by the TESSDATA_PREFIX variable.

Value

no return value, called for side effects

References

tesseract wiki: training data

Examples

# download the french training data
# this is wrapped around a \donttest{} block because otherwise the clang19
# CRAN check will fail with a "> 5 seconds" message

 dir <- tempdir()
 tesseract_download("fra", model = "best", datapath = dir)
 file <- system.file("examples", "french.png", package = "cpp11tesseract")
 text <- ocr(file, engine = tesseract("fra", datapath = dir))
 cat(text)

# download the greek training data
# this is wrapped around a \donttest{} block because otherwise the clang19
# CRAN check will fail with a "> 5 seconds" message

 dir <- tempdir()
 tesseract_contributed_download("grc_hist", model = "best", datapath = dir)
 file <- system.file("examples", "polytonicgreek.png",
   package = "cpp11tesseract")
 text <- ocr(file, engine = tesseract("grc_hist", datapath = dir))
 cat(text)

cpp11tesseract documentation built on April 4, 2025, 5:24 a.m.