readPDF: Read In a PDF Document
In tm: Text Mining Package

Description Usage Arguments Details Value See Also Examples

Return a function which reads in a portable document format (PDF) document extracting both its text and its metadata.

1
2
3

readPDF(engine = c("pdftools", "xpdf", "Rpoppler",
                   "ghostscript", "Rcampdf", "custom"),
        control = list(info = NULL, text = NULL))

`engine`	a character string for the preferred PDF extraction engine (see Details).
`control`	a list of control options for the engine with the named components `info` and `text` (see Details).

Formally this function is a function generator, i.e., it returns a function (which reads in a text document) with a well-defined signature, but can access passed over arguments (e.g., the preferred PDF extraction engine and control options) via lexical scoping.

Available PDF extraction engines are as follows.

"pdftools": (default) Poppler PDF rendering library as provided by the functions pdf_info and pdf_text in package pdftools.
"xpdf": command line pdfinfo and pdftotext executables which must be installed and accessible on your system. Suitable utilities are provided by the Xpdf (http://www.xpdfreader.com/) PDF viewer or by the Poppler (https://poppler.freedesktop.org/) PDF rendering library.
"Rpoppler": Poppler PDF rendering library as provided by the functions PDF_info and PDF_text in package Rpoppler.
"ghostscript": Ghostscript using ‘pdf_info.ps’ and ‘ps2ascii.ps’.
"Rcampdf": Perl CAM::PDF PDF manipulation library as provided by the functions pdf_info and pdf_text in package Rcampdf, available from the repository at http://datacube.wu.ac.at.
"custom": custom user-provided extraction engine.

Control parameters for engine "xpdf" are as follows.

info: a character vector specifying options passed over to the pdfinfo executable.
text: a character vector specifying options passed over to the pdftotext executable.

Control parameters for engine "custom" are as follows.

info: a function extracting metadata from a PDF. The function must accept a file path as first argument and must return a named list with the components Author (as character string), CreationDate (of class POSIXlt), Subject (as character string), Title (as character string), and Creator (as character string).
text: a function extracting content from a PDF. The function must accept a file path as first argument and must return a character vector.

A function with the following formals:

elem: a named list with the component uri which must hold a valid file name.
language: a string giving the language.
id: Not used.

The function returns a PlainTextDocument representing the text and metadata extracted from elem$uri.

Reader for basic information on the reader infrastructure employed by package tm.

uri <- paste0("file://",
              system.file(file.path("doc", "tm.pdf"), package = "tm"))
engine <- if(nzchar(system.file(package = "pdftools"))) {
    "pdftools" 
} else {
    "ghostscript"
}
reader <- readPDF(engine)
pdf <- reader(elem = list(uri = uri), language = "en", id = "id1")
cat(content(pdf)[1])
VCorpus(URISource(uri, mode = ""),
        readerControl = list(reader = readPDF(engine = "ghostscript")))

Loading required package: NLP
                                 Introduction to the tm Package
                                             Text Mining in R
                                                  Ingo Feinerer
                                               December 21, 2018
Introduction
This vignette gives a short introduction to text mining in R utilizing the text mining framework provided by
the tm package. We present methods for data import, corpus handling, preprocessing, metadata management,
and creation of term-document matrices. Our focus is on the main aspects of getting started with text mining
in R<U+2014>an in-depth description of the text mining infrastructure offered by tm was published in the Journal of
Statistical Software (Feinerer et al., 2008). An introductory article on text mining in R was published in R
News (Feinerer, 2008).
Data Import
The main structure for managing documents in tm is a so-called Corpus, representing a collection of text
documents. A corpus is an abstract concept, and there can exist several implementations in parallel. The
default implementation is the so-called VCorpus (short for Volatile Corpus) which realizes a semantics as known
from most R objects: corpora are R objects held fully in memory. We denote this as volatile since once the
R object is destroyed, the whole corpus is gone. Such a volatile corpus can be created via the constructor
VCorpus(x, readerControl). Another implementation is the PCorpus which implements a Permanent Corpus
semantics, i.e., the documents are physically stored outside of R (e.g., in a database), corresponding R objects
are basically only pointers to external structures, and changes to the underlying corpus are reflected to all R
objects associated with it. Compared to the volatile corpus the corpus encapsulated by a permanent corpus
object is not destroyed if the corresponding R object is released.
    Within the corpus constructor, x must be a Source object which abstracts the input location. tm provides a
set of predefined sources, e.g., DirSource, VectorSource, or DataframeSource, which handle a directory, a vector
interpreting each component as document, or data frame like structures (like CSV files), respectively. Except
DirSource, which is designed solely for directories on a file system, and VectorSource, which only accepts (char-
acter) vectors, most other implemented sources can take connections as input (a character string is interpreted
as file path). getSources() lists available sources, and users can create their own sources.
    The second argument readerControl of the corpus constructor has to be a list with the named components
reader and language. The first component reader constructs a text document from elements delivered by
a source. The tm package ships with several readers (e.g., readPlain(), readPDF(), readDOC(), . . . ). See
getReaders() for an up-to-date list of available readers. Each source has a default reader which can be
overridden. E.g., for DirSource the default just reads in the input files and interprets their content as text.
Finally, the second component language sets the texts<U+2019> language (preferably using ISO 639-2 codes).
    In case of a permanent corpus, a third argument dbControl has to be a list with the named components
dbName giving the filename holding the sourced out objects (i.e., the database), and dbType holding a valid
database type as supported by package filehash. Activated database support reduces the memory demand,
however, access gets slower since each operation is limited by the hard disk<U+2019>s read and write capabilities.
    So e.g., plain text files in the directory txt containing Latin (lat) texts by the Roman poet Ovid can be
read in with following code:
> txt <- system.file("texts", "txt", package = "tm")
> (ovid <- VCorpus(DirSource(txt, encoding = "UTF-8"),
+                        readerControl = list(language = "lat")))
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 5
                                                         1
GPL Ghostscript 9.26: Unrecoverable error, exit code 1
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 1
Warning message:
running command ''/usr/bin/gs' -q -dNODISPLAY -P- -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dSIMPLE -c save -f ps2ascii.ps /work/tmp/tmp/RtmpK1hewM/pdf7ae14a26185e -c quit' had status 1