jacob-ogre/pdftext: Extract Text from Text- and Image-based PDFs

A large batch of PDFs may contain a mix of text-based and image- based PDFs, and one needs to extract the text from all of these files for analysis. This package offers a single primary function to perform text extraction from PDFs by trying the poppler library's wrapper in pdftools; if that fails, then Imagemagick, unpaper, and Tesseract are used to perform Optical Character Recognition.

Getting started

Package details

LicenseBSD_2_clause + file LICENSE
Package repositoryView on GitHub
Installation Install the latest version of this package by entering the following in R:
jacob-ogre/pdftext documentation built on May 18, 2019, 8:01 a.m.