In dsidavis/SpilloverDA: Munging and Exploring the Viral Spillover Data Set

SpilloverDA includes functions which allow you to extract keywords from a PDF document.

Setup

In addition to the R package requirements, to fully use these tools you will need additional software on your system:

pdftohtml, preferrably our extended version available on github. Please see installation instructions on that page.
Our fork of the Epitator tool for keyword extraction, found here.
The Epitator dictionary will need to be created. Instructions here.

Additional setup: OCR

If you would like to extract information from documents which are scanned, i.e., are only an image without text which can be selected and copied out of the pdf, you will need additional packages/software to do OCR (Optical Character Recognition) prior to extracting keywords:

Rtesseract and tesseract: Rtesseract is available from github, and requires that you also install Tesseract and language files. Installation instructions are available on each of those pages.
Imagemagik convert or similar utility to convert the PDF file to an image file, e.g. PNG, JPG, etc.

Overview of process

Extraction of text

The first step is converting the PDF document to an XML file using pdftohtml. SpilloverDA uses a convienence function ReadPDF::convertPDF2XML() to do the conversion by default.
The XML is analyzed by ReadPDF::isScanned() to determine if OCR will be needed.

If OCR is NOT needed:

The text is extracted from the XML document and is separated into sections by ReadPDF::readXMLSections()

If OCR IS needed:

The PDF is OCRed to text by ocrPDF()

a. The original PDF is converted to image files using a conversion utility, e.g. convert

b. The image files are OCR-ed by Rtesseract::tessreact()

c. The text is reconstructed into sections NOTE: Currently, the identification of sections from the OCR results is not fully implimented.

Extraction of keywords

The extraction of keywords is handled by doc2keyword

a. The sectioned text is written to a tmp file

b. The tmp file is used as an input to Epitator, which extracts the keywords and writes the results to a second tmp file as JSON.

c. The JSON is read back into R using RSJONIO::fromJSON()

Extracting data from JSON

NEEDS WORK

Since the results from JSON are a nested list, convienence functions are used to extract the results into a data.frame;

getLocation
getSpecies
getGoldStandardTest
getVirus2
getSpeciesAbb
getDate

These different data types are kept separate for now, since each data type needs a separate model to assign the score. A typical use of these would be:

funs = c(getLocation, getSpecies, getGoldStandTest, getVirus2,
         getSpeciesAbb, getDate)

vars = c("location", "species", "diagtest", "virus", "sp_abb", "date")

test.vars = lapply(funs, function(fn) {
    try(fn(pdf_results))
})

This results in a list, with each data type being an element in the list. Within each element, there will be a data.frame of extracted results.

Next, these need to be further collapsed using mkTestSet() which collects multiple mentions into a single row per term, with the relevant metrics (number of mentions, sections, etc) added to the data.frame.

Scoring of results

NEEDS WORK

Terms from the JSON are collected into a data.frame, along with relevant context information (number of mentions, sections where the term occurred, etc.)
The terms are scored using a trained classifier. Currently, the classifier is a random-forest model trained on hand-entered data using the partykit package.
The scored results are returned as a data.frame