Getting Started with pdfsearch"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This package defines a few useful functions for keyword searching using the pdftools package developed by rOpenSci.

Basic Usage

There are currently two functions in this package of use to users. The first keyword_search takes a single pdf and searches for keywords from the pdf. The second keyword_directory does the same search over a directory of pdfs.

keyword_search Example

The package comes with two pdf files from arXiv to use as test cases. Below is an example of using the keyword_search function.

library(pdfsearch)
file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')

result <- keyword_search(file, 
            keyword = c('measurement', 'error'),
            path = TRUE)
head(result)
head(result$line_text, n = 2)

The location of the keyword match, including page number and line number, and the actual line of text are returned by default.

Surrounding lines of text

It may be useful to extract not just the line of text that the keyword is found in, but also surrounding text to have additional context when looking at the keyword results. This can be added by using the argument surround_lines as follows:

file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')

result <- keyword_search(file, 
            keyword = c('measurement', 'error'),
            path = TRUE, surround_lines = 1)
head(result)
head(result$line_text, n = 2)

Combine hyphenated words

Typeset PDF files commonly contain words that wrap from one line to the next and are hyphenated. An example of this is shown in the following image.

hyphenated example

Any hyphenated words are treated as two words and the keyword search may not perform as desired if a matching word would be returned if it is not hyphenated. Fortunately, there is a remove_hyphen argument within the keyword_search function that removes the hyphenated words at the end of a line and combines them with the word on the next line in the document. Below is an example of this working, showing the results before and after using the remove_hyphen argument. By default this argument is set to TRUE.

file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')

result_hyphen <- keyword_search(file, 
            keyword = c('measurement'),
            path = TRUE, remove_hyphen = FALSE)

result_remove_hyphen <- keyword_search(file, 
            keyword = c('measurement'),
            path = TRUE, remove_hyphen = TRUE)

nrow(result_hyphen)
nrow(result_remove_hyphen)

You'll notice that the removal of the hyphen added a few additional keyword matches to the results. These were cases where the word "measurement" wrapped across two lines and was hyphenated (see the image above that has an example of this).

One specific note about removing hyphens in multiple column PDF files. The ability of the function to perform this action is still experimental and many times does not work the best as of yet. Use the remove_hyphen argument with caution with multiple column PDF files.

Split document into words

Using the tokenizers R package, it is also possible to split the document into individual words. This may be most useful when the interest is in performing a text analysis rather than a keyword search. Below is an example showing the first page of the text converted to words. By default, hyphenated words at the end of the lines are removed (see previous section for description of this).

token_result <- convert_tokens(file, path = TRUE)[[1]]
head(token_result)

Another implementation of the convert_tokens function, is to convert the result text to tokens. This could be interesting when used in tandem with the surround_lines argument for input into a text analysis. These tokens are included by default when calling the keyword_search function.

file <- system.file('pdf', '1501.00450.pdf', package = 'pdfsearch')

result <- keyword_search(file, 
            keyword = c('repeated measures', 'mixed effects'),
            path = TRUE, surround_lines = 1)
result

keyword_directory Example

The keyword_directory function is useful when you have a directory of many pdf files that you want to search a series of keywords in a single function call. This can be particularly useful in the context of a research synthesis or to screen studies for characteristics to include in a meta-analysis.

There are two files that come with the package from ArXiv in a single directory that will be used as an example use case for the package.

directory <- system.file('pdf', package = 'pdfsearch')

result <- keyword_directory(directory,
                           keyword = c('repeated measures', 'mixed effects',
                                       'error'),
                           surround_lines = 1, full_names = TRUE)
head(result)

The full_names argument is needed here to specify that the full file path needs to be used to access the pdf files. If the search is done directly from the repository (i.e. when using an R project in RStudio), then full_names could be set to FALSE.

Limitations

Currently there are a handful of limitations, mostly around how pdfs are read into R using the pdftools R package. When pdfs are created in a multiple column layout, a line in the pdf consists of the entire line across both columns. This can lead to fragmented text that may not give the full contents, even with using the surround_lines argument.

Another limitation is when performing keyword searching with multiple words or phrases. If the match is on a single line, the match would be returned. However, if the words or phrase spans multiple lines, the current implementation will not return a result that spans multiple lines in the PDF file.

Shiny App

The package also has a simple Shiny app that can be called using the following command

run_shiny()


Try the pdfsearch package in your browser

Any scripts or data that you put into this service are public.

pdfsearch documentation built on May 1, 2019, 8:01 p.m.