library(TessTools) knitr::opts_chunk$set(message=FALSE, warning=FALSE, collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
Interface to the Tesseract OCR command line tool (version 4) and parsing functions for the analysis of historical newspaper archives. This is under development.
Make sure you have the tesseract command line program installed and available in PATH. You can either Install Tesseract via pre-built binary package or build it from source.
$ tesseract Usage: tesseract --help | --help-extra | --version tesseract --list-langs tesseract imagename outputbase [options...] [configfile...]
You can install the development version of TessTools
from GitHub with:
# install.packages("devtools") devtools::install_github("OlivierBinette/TessTools")
Download the first issue (1905) of the Duke Chronicle newspaper.
library(TessTools) issueID = chronicle_meta[1, "local_id"] zipfile = download_chronicle(issueID, outputdir="data-raw")
Run Tesseract OCR on the newspaper scans and extract text paragraphs together with their bounding boxes.
hocrfiles = hocr_from_zip(zipfile, outputdir="data-raw/hocr", exdir="data-raw/img") # Extract paragraph text text = paragraphs(hocrfiles) text[[1]][9:11, ] # Some paragraphs on the first page
Visualize the result using hocrjs:
webpages = visualize_html(hocrfiles, outputdir="data-raw/html") # webpage is at data-raw/html/dchnp71001-html browseURL(webpages[[1]]) # Note: bring up the hocrjs menu and select "show background image"
Paragraphs of the first issue have been annotated according to the article to which they belong.
# Ground truth for first page vol1_paragraphs_truth[[1]][9:11, ]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.