convert_tokens: Ability to tokenize words.

Description Usage Arguments Value Examples

View source: R/convert_tokens.r

Description

Ability to tokenize words.

Usage

1
2
convert_tokens(x, path = FALSE, split_pdf = FALSE,
  remove_hyphen = TRUE, token_function = NULL)

Arguments

x

The text of the pdf file. This can be specified directly or the pdftools package is used to read the pdf file from a file path. To use the pdftools, the path argument must be set to TRUE.

path

An optional path designation for the location of the pdf to be converted to text. The pdftools package is used for this conversion.

split_pdf

TRUE/FALSE indicating whether to split the pdf using white space. This would be most useful with multicolumn pdf files. The split_pdf function attempts to recreate the column layout of the text into a single column starting with the left column and proceeding to the right.

remove_hyphen

TRUE/FALSE indicating whether hyphenated words should be adjusted to combine onto a single line. Default is TRUE.

token_function

This is a function from the tokenizers package. Default is the tokenize_words function.

Value

A list of character vectors containing the tokens. More detail can be found looking at the documentation of the tokenizers package.

Examples

1
2
 file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')
 convert_tokens(file, path = TRUE) 

pdfsearch documentation built on May 1, 2019, 8:01 p.m.