heading_search: Function to locate sections of pdf

View source: R/heading_search.r

heading_searchR Documentation

Function to locate sections of pdf

Description

The ability to extract the location of the text and separate by sections. The function will return the headings with their location in the pdf.

Usage

heading_search(
  x,
  headings,
  path = FALSE,
  pdf_toc = FALSE,
  full_line = FALSE,
  ignore_case = FALSE,
  split_pdf = FALSE,
  convert_sentence = FALSE
)

Arguments

x

Either the text of the pdf read in with the pdftools package or a path for the location of the pdf file.

headings

A character vector representing the headings to search for. Can be NULL if pdf_toc = TRUE.

path

An optional path designation for the location of the pdf to be converted to text. The pdftools package is used for this conversion.

pdf_toc

TRUE/FALSE whether the pdf_toc function should be used from the pdftools package. This is most useful if the pdf has the table of contents embedded within the pdf. Must specify path = TRUE if pdf_toc = TRUE.

full_line

TRUE/FALSE indicating whether the headings should reside on their own line. This can create problems with multiple column pdfs.

ignore_case

TRUE/FALSE/vector of TRUE/FALSE, indicating whether the case of the keyword matters. Default is FALSE meaning that case of the headings keywords are literal. If a vector, must be same length as the headings vector.

split_pdf

TRUE/FALSE indicating whether to split the pdf using white space. This would be most useful with multicolumn pdf files. The split_pdf function attempts to recreate the column layout of the text into a single column starting with the left column and proceeding to the right.

convert_sentence

TRUE/FALSE indicating if individual lines of PDF file should be collapsed into a single large paragraph to perform keyword searching. Default is FALSE

Examples

file <- system.file('pdf', '1501.00450.pdf', package = 'pdfsearch')

heading_search(file, headings = c('abstract', 'introduction'),
  path = TRUE)


lebebr01/pdfsearch documentation built on July 17, 2022, 7:02 a.m.