convert_tokens: Ability to tokenize words.

View source: R/convert_tokens.r

convert_tokensR Documentation

Ability to tokenize words.

Description

Ability to tokenize words.

Usage

convert_tokens(
  x,
  path = FALSE,
  split_pdf = FALSE,
  remove_hyphen = TRUE,
  token_function = NULL
)

Arguments

x

The text of the pdf file. This can be specified directly or the pdftools package is used to read the pdf file from a file path. To use the pdftools, the path argument must be set to TRUE.

path

An optional path designation for the location of the pdf to be converted to text. The pdftools package is used for this conversion.

split_pdf

TRUE/FALSE indicating whether to split the pdf using white space. This would be most useful with multicolumn pdf files. The split_pdf function attempts to recreate the column layout of the text into a single column starting with the left column and proceeding to the right.

remove_hyphen

TRUE/FALSE indicating whether hyphenated words should be adjusted to combine onto a single line. Default is TRUE.

token_function

This is a function from the tokenizers package. Default is the tokenize_words function.

Value

A list of character vectors containing the tokens. More detail can be found looking at the documentation of the tokenizers package.

Examples

 file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')
 convert_tokens(file, path = TRUE) 


lebebr01/pdfsearch documentation built on July 17, 2022, 7:02 a.m.