read_pdf: Read a Portable Document Format into R
In textreadr: Read Text Documents into R

Description Usage Arguments Value Note Examples

A wrapper for pdftools::pdf_text() to read PDFs into R.

1	read_pdf(file, skip = 0, remove.empty = TRUE, trim = TRUE, ocr = TRUE, ...)

`file`	A path to a PDF file.
`skip`	Integer; the number of lines of the data file to skip before beginning to read data.
`remove.empty`	logical. If `TRUE` empty elements in the vector are removed.
`trim`	logical. If `TRUE` the leading/training white space is removed.
`ocr`	logical. If `TRUE` documents with a non-text pull using pdftools::pdf_text() will be re-run using OCR via the `tesseract::ocr()` function. This will create temporary .png files and will require a much larger compute time.
`...`	Other arguments passed to pdftools::pdf_text().

Returns a base::data.frame() with the page number (page_id), line number (element_id), and the text.

A word of caution from Carl Witthoft" "Just a warning to others who may be hoping to extract data: PDF is a container, not a format. If the original document does not contain actual text, as opposed to bitmapped images of text or possibly even uglier things than I can imagine, nothing other than OCR can help you." If the reader has OCR needs the tesseract package, available on CRAN (https://CRAN.R-project.org/package=tesseract), is an "OCR engine with Unicode (UTF-8) support" and may be of use.

pdf_dat <- read_pdf(
    system.file("docs/rl10075oralhistoryst002.pdf", package = "textreadr")
)

pdf_dat_b <- read_pdf(
    system.file("docs/rl10075oralhistoryst002.pdf", package = "textreadr"),
    skip = 1
)

## Not run: 
library(textshape)
system.file("docs/rl10075oralhistoryst002.pdf", package = "textreadr") %>%
    read_pdf(1) %>%
    `[[`('text') %>%
    head(-1) %>%
    textshape::combine() %>%
    gsub("([A-Z])( )([A-Z])", "\\1_\\3", .) %>%
    strsplit("(-| )(?=[A-Z_]+:)", perl=TRUE) %>%
    `[[`(1) %>%
    textshape::split_transcript()

## End(Not run)

## Not run: 
## An image based .pdf file returns nothing.  Using the tesseract package as
## a backend for OCR overcomes this problem.

## Non-ocr
read_pdf(
    system.file("docs/McCune2002Choi2010.pdf", package = "textreadr"),
    ocr = FALSE
)

read_pdf(
    system.file("docs/McCune2002Choi2010.pdf", package = "textreadr"),
    ocr = TRUE
)

## End(Not run)