SpilloverDA includes functions which allow you to extract keywords from a PDF document.
In addition to the R package requirements, to fully use these tools you will need additional software on your system:
pdftohtml, preferrably our extended version available on github. Please see installation instructions on that page.
Our fork of the Epitator tool for keyword extraction, found here.
The Epitator dictionary will need to be created. Instructions here.
If you would like to extract information from documents which are scanned, i.e., are only an image without text which can be selected and copied out of the pdf, you will need additional packages/software to do OCR (Optical Character Recognition) prior to extracting keywords:
Rtesseract and tesseract: Rtesseract is available from github, and requires that you also install Tesseract and language files. Installation instructions are available on each of those pages.
Imagemagik convert
or similar utility to convert the PDF file to
an image file, e.g. PNG, JPG, etc.
The first step is converting the PDF document to an XML file using
pdftohtml. SpilloverDA uses a convienence function
ReadPDF::convertPDF2XML()
to do the conversion by default.
The XML is analyzed by ReadPDF::isScanned()
to determine if OCR
will be needed.
If OCR is NOT needed:
ReadPDF::readXMLSections()
If OCR IS needed:
ocrPDF()
a. The original PDF is converted to image files using a conversion
utility, e.g. convert
b. The image files are OCR-ed by Rtesseract::tessreact()
c. The text is reconstructed into sections NOTE: Currently, the identification of sections from the OCR results is not fully implimented.
doc2keyword
a. The sectioned text is written to a tmp file
b. The tmp file is used as an input to Epitator, which extracts the keywords and writes the results to a second tmp file as JSON.
c. The JSON is read back into R using RSJONIO::fromJSON()
NEEDS WORK
Since the results from JSON are a nested list, convienence functions are used to extract the results into a data.frame;
These different data types are kept separate for now, since each data type needs a separate model to assign the score. A typical use of these would be:
funs = c(getLocation, getSpecies, getGoldStandTest, getVirus2, getSpeciesAbb, getDate) vars = c("location", "species", "diagtest", "virus", "sp_abb", "date") test.vars = lapply(funs, function(fn) { try(fn(pdf_results)) })
This results in a list, with each data type being an element in the list. Within each element, there will be a data.frame of extracted results.
Next, these need to be further collapsed using mkTestSet()
which
collects multiple mentions into a single row per term, with the
relevant metrics (number of mentions, sections, etc) added to the data.frame.
NEEDS WORK
Terms from the JSON are collected into a data.frame, along with relevant context information (number of mentions, sections where the term occurred, etc.)
The terms are scored using a trained classifier. Currently, the
classifier is a random-forest model trained on hand-entered data
using the partykit
package.
The scored results are returned as a data.frame
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.