knitr::opts_chunk$set(echo = TRUE)
This vignette walks the user through applying the neural embedding NLP approach to a novel set of PDF documents. We use a sample corpus of eight peer-reviewed academic journal articles about restoration.
R can be downloaded from this link. Once it is downloaded, open up the 32-bit version (i386, as WRI computers only seem to have 32-bit version of Java). Then, you can proceed to installing the package by running the following lines of code. Copy and paste them one at a time and press enter.
install.packages("devtools") library(devtools) install_github("wri/retrieveR")
Next, we load up the package into R using library
. Depending on your operating system, you then need to run either install_mac
or install_windows
- these functions will get the Java dependencies to extract text from images, as well as install the necessary components to run neural networks.
Finally, the download_example
function will download the example PDFs.
library(retrieveR) install_mac() install_windows() download_example()
The prep_documents
function will strip text from the PDFs, clean up the results, and calculate neural weights. These can be turned off by specifying ocr = F
, clean = F
, or weights = F
. The function takes a path to the folder of documents - in this case they are stored in a folder called pdfs
. This pathing is local to the directory that R is running in - this can be printed with getwd()
and changed with setwd()
.
corpus <- prep_documents("pdfs")
The create_report
function takes the following arguments:
prep_documents
is stored to.create_report(query = "food water waste wastewater reuse", data = corpus)
create_report(query="land tenure", data = corpus, interactive = F, thresh = 0.51)
The results of the create_report
function are stored in an html file in the working directory. I have included the results within this file for ease of example.
htmltools::includeHTML("land_tenure.html")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.