View source: R/convert_tokens.r
convert_tokens | R Documentation |
Ability to tokenize words.
convert_tokens(
x,
path = FALSE,
split_pdf = FALSE,
remove_hyphen = TRUE,
token_function = NULL
)
x |
The text of the pdf file. This can be specified directly or the pdftools package is used to read the pdf file from a file path. To use the pdftools, the path argument must be set to TRUE. |
path |
An optional path designation for the location of the pdf to be converted to text. The pdftools package is used for this conversion. |
split_pdf |
TRUE/FALSE indicating whether to split the pdf using white space. This would be most useful with multicolumn pdf files. The split_pdf function attempts to recreate the column layout of the text into a single column starting with the left column and proceeding to the right. |
remove_hyphen |
TRUE/FALSE indicating whether hyphenated words should be adjusted to combine onto a single line. Default is TRUE. |
token_function |
This is a function from the tokenizers package. Default is the tokenize_words function. |
A list of character vectors containing the tokens. More detail can be found looking at the documentation of the tokenizers package.
file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')
convert_tokens(file, path = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.