read_document: Generic Function to Read in a Document

Description Usage Arguments Value Examples

View source: R/read_document.R

Description

Generic function to read in a .pdf, .txt, .html, .rtf, .docx, or .doc file.

Usage

1
2
read_document(file, skip = 0, remove.empty = TRUE, trim = TRUE,
  combine = FALSE, format = FALSE, ocr = TRUE, ...)

Arguments

file

The path to the a .pdf, .txt, .html, .rtf, .docx, or .doc file.

skip

The number of lines to skip.

remove.empty

logical. If TRUE empty elements in the vector are removed.

trim

logical. If TRUE the leading/training white space is removed.

combine

logical. If TRUE the vector is concatenated into a single string via combine.

format

For .doc files only. Logical. If TRUE the output will keep doc formatting (e.g., bold, italics, underlined). This corresponds to the -f flag in antiword.

ocr

logical. If TRUE .pdf documents with a non-text pull using pdf_text will be re-run using OCR via the ocr function. This will create temporary .png files and will require a much larger compute time.

...

Other arguments passed to read_pdf, read_html, read_docx, read_doc, or readLines.

Value

Returns a list of string vectors.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
## .pdf
pdf_doc <- system.file("docs/rl10075oralhistoryst002.pdf",
    package = "textreadr")
read_document(pdf_doc)

## .html
html_doc <- system.file("docs/textreadr_creed.html", package = "textreadr")
read_document(html_doc)

## .docx
docx_doc <- system.file("docs/Yasmine_Interview_Transcript.docx",
    package = "textreadr")
read_document(docx_doc)

## .doc
doc_doc <- system.file("docs/Yasmine_Interview_Transcript.doc",
    package = "textreadr")
read_document(doc_doc)

## .txt
txt_doc <- system.file('docs/textreadr_creed.txt', package = "textreadr")
read_document(txt_doc)

## .rtf
## Not run: 
rtf_doc <- download(
    'https://raw.githubusercontent.com/trinker/textreadr/master/inst/docs/trans7.rtf'
)
read_document(rtf_doc)

## End(Not run)

## Not run: 
## URLs
read_document('http://www.talkstats.com/index.php')

## End(Not run)

textreadr documentation built on Sept. 28, 2018, 5:09 p.m.