In OlivierBinette/TessTools: Tools for the use of Tesseract OCR in R

library(TessTools)
knitr::opts_chunk$set(message=FALSE, warning=FALSE,
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

TessTools: Tools for the use of Tesseract OCR in R

Interface to the Tesseract OCR command line tool (version 4) and parsing functions for the analysis of historical newspaper archives. This is under development.

Installation

Make sure you have the tesseract command line program installed and available in PATH. You can either Install Tesseract via pre-built binary package or build it from source.

$ tesseract
Usage:
  tesseract --help | --help-extra | --version
  tesseract --list-langs
  tesseract imagename outputbase [options...] [configfile...]

You can install the development version of TessTools from GitHub with:

# install.packages("devtools")
devtools::install_github("OlivierBinette/TessTools")

Example

Download the first issue (1905) of the Duke Chronicle newspaper.

library(TessTools)

issueID = chronicle_meta[1, "local_id"]
zipfile = download_chronicle(issueID, outputdir="data-raw")

Run Tesseract OCR on the newspaper scans and extract text paragraphs together with their bounding boxes.

hocrfiles = hocr_from_zip(zipfile, outputdir="data-raw/hocr", exdir="data-raw/img")

# Extract paragraph text
text = paragraphs(hocrfiles)
text[[1]][9:11, ] # Some paragraphs on the first page

Visualize the result using hocrjs:

webpages = visualize_html(hocrfiles, outputdir="data-raw/html") # webpage is at data-raw/html/dchnp71001-html
browseURL(webpages[[1]]) # Note: bring up the hocrjs menu and select "show background image"

Ground truth

Paragraphs of the first issue have been annotated according to the article to which they belong.

# Ground truth for first page
vol1_paragraphs_truth[[1]][9:11, ]

OlivierBinette/TessTools documentation built on March 13, 2024, 7:33 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

OlivierBinette/TessTools
Tools for the use of Tesseract OCR in R

In OlivierBinette/TessTools: Tools for the use of Tesseract OCR in R

TessTools: Tools for the use of Tesseract OCR in R

Installation

Example

Ground truth

R Package Documentation

Browse R Packages

We want your feedback!

OlivierBinette/TessTools Tools for the use of Tesseract OCR in R

In OlivierBinette/TessTools: Tools for the use of Tesseract OCR in R

TessTools: Tools for the use of Tesseract OCR in R

Installation

Example

Ground truth

R Package Documentation

Browse R Packages

We want your feedback!

OlivierBinette/TessTools
Tools for the use of Tesseract OCR in R