format_text: Format PDF input text

View source: R/format_text.r

format_textR Documentation

Format PDF input text

Description

Performs some formatting of pdf text upon import.

Usage

format_text(
  pdf_text,
  split_pdf = FALSE,
  remove_hyphen = TRUE,
  convert_sentence = TRUE,
  remove_equations = FALSE,
  split_pattern = "\\p{WHITE_SPACE}{3,}",
  ...
)

Arguments

pdf_text

A list of text from PDF import, most likely from 'pdftools::pdf_text()'. Each element of the list is a unique page of text from the PDF.

split_pdf

TRUE/FALSE indicating whether to split the pdf using white space. This would be most useful with multicolumn pdf files. The split_pdf function attempts to recreate the column layout of the text into a single column starting with the left column and proceeding to the right.

remove_hyphen

TRUE/FALSE indicating whether hyphenated words should be adjusted to combine onto a single line. Default is TRUE.

convert_sentence

TRUE/FALSE indicating if individual lines of PDF file should be collapsed into a single large paragraph to perform keyword searching. Default is TRUE

remove_equations

TRUE/FALSE indicating if equations should be removed. Default behavior is to search for the following regex: "\([0-9]1,\)$", essentially this matches a literal parenthesis, followed by at least one number followed by another parenthesis at the end of the text line. This will not detect other patterns or detect the entire equation if it is a multi-row equation.

split_pattern

Regular expression pattern used to split multicolumn PDF files using stringi::stri_split_regex. Default pattern is "\pWHITE_SPACE3," which can be interpreted as: split based on three or more consecutive white space characters.

...

Additional arguments, currently not used.


lebebr01/pdfsearch documentation built on July 17, 2022, 7:02 a.m.