crm_extract: Extract text from a single pdf document

View source: R/crm_extract.R

crm_extractR Documentation

Extract text from a single pdf document

Description

Extract text from a single pdf document

Usage

crm_extract(path = NULL, raw = NULL, try_ocr = FALSE, ...)

Arguments

path

(character) path to a file, file must exist

raw

(raw) raw bytes

try_ocr

(logical) whether to try extracting OCRed pages with pdftools::pdf_ocr_text(). default: FALSE. if FALSE, we use pdftools::pdf_text()

...

args passed on to pdftools::pdf_info() and pdftools::pdf_text() (or pdftools::pdf_ocr_text() if try_ocr=TRUE) - any args are passed to both of those function calls, which makes sense

Details

We use pdftools under the hood to do pdf text extraction.

You have to supply either path or raw - not both.

Value

An object of class crm_pdf with a slot for info (pdf metadata essentially), and text (the extracted text) - with an attribute (path) with the path to the pdf on disk

Examples

path <- system.file("examples", "MairChamberlain2014RJournal.pdf",
   package = "crminer")
(res <- crm_extract(path))
res$info
res$text
# with newlines, pretty print
cat(res$text)

# another example
path <- system.file("examples", "ChamberlainEtal2013Ecosphere.pdf",
   package = "crminer")
(res <- crm_extract(path))
res$info
cat(res$text)

# with raw pdf bytes
path <- system.file("examples", "raw-example.rds", package = "crminer")
rds <- readRDS(path)
class(rds)
crm_extract(raw = rds)

ropensci/crminer documentation built on May 18, 2022, 9:50 a.m.