crm_extract: Extract text from a single pdf document
In ropenscilabs/crminer: Fetch 'Scholary' Full Text from 'Crossref'

crm_extract

R Documentation

Extract text from a single pdf document

Description

Extract text from a single pdf document

Usage

crm_extract(path = NULL, raw = NULL, try_ocr = FALSE, ...)

Arguments

`path`	(character) path to a file, file must exist
`raw`	(raw) raw bytes
`try_ocr`	(logical) whether to try extracting OCRed pages with `pdftools::pdf_ocr_text()`. default: `FALSE`. if `FALSE`, we use `pdftools::pdf_text()`
`...`	args passed on to `pdftools::pdf_info()` and `pdftools::pdf_text()` (or `pdftools::pdf_ocr_text()` if `try_ocr=TRUE`) - any args are passed to both of those function calls, which makes sense

Details

We use pdftools under the hood to do pdf text extraction.

You have to supply either path or raw - not both.

Value

An object of class crm_pdf with a slot for info (pdf metadata essentially), and text (the extracted text) - with an attribute (path) with the path to the pdf on disk

Examples

path <- system.file("examples", "MairChamberlain2014RJournal.pdf",
   package = "crminer")
(res <- crm_extract(path))
res$info
res$text
# with newlines, pretty print
cat(res$text)

# another example
path <- system.file("examples", "ChamberlainEtal2013Ecosphere.pdf",
   package = "crminer")
(res <- crm_extract(path))
res$info
cat(res$text)

# with raw pdf bytes
path <- system.file("examples", "raw-example.rds", package = "crminer")
rds <- readRDS(path)
class(rds)
crm_extract(raw = rds)

ropenscilabs/crminer documentation built on May 18, 2022, 7:36 p.m.