getPDF: Extract text from PDF files and return a word-occurrence...

View source: R/inpdfr_PRO_extractTxt.R

getPDFR Documentation

Extract text from PDF files and return a word-occurrence data.frame.

Description

getPDF returns a word-occurrence data.frame from PDF files. It needs XPDF in order to run (http://www.foolabs.com/xpdf/download.html), and uses parallel to perform parallel computation.

Usage

getPDF(
  myPDFs,
  minword = 1,
  maxword = 20,
  minFreqWord = 1,
  pathToPdftotext = ""
)

Arguments

myPDFs

A character vector containing PDF file names.

minword

An integer specifying the minimum number of letters per word into the returned data.frame.

maxword

An integer to specifying the maximum number of letters per word into the returned data.frame.

minFreqWord

An integer specifying the minimum word frequency into the returned data.frame.

pathToPdftotext

A character containing an alternative path to XPDF pdftotext function, see Details section.

Details

getPDF uses XPDF pdftotext function to extract the content of PDF files into a TXT file. If pdftotext is not in the PATH, an alternative is to provide the full path of the program into the pathToPdftotext parameter.

Value

A list of list with word-occurrence data.frame and file name.

Examples

## Not run: 
getPDF(myPDFs = "mypdf.pdf")

## End(Not run)

inpdfr documentation built on Aug. 24, 2023, 9:09 a.m.