read_html: Read in .html Content

Description Usage Arguments Value References Examples

Description

Read in the content from a .html file. This is generalized, reading in all body text. For finer control the user should utilize the xml2 and rvest packages.

Usage

1
2
3
read_html(file, skip = 0, remove.empty = TRUE, trim = TRUE, ...)

read_xml(file, skip = 0, remove.empty = TRUE, trim = TRUE, ...)

Arguments

file

The path to the .html file.

skip

The number of lines to skip.

remove.empty

logical. If TRUE empty elements in the vector are removed.

trim

logical. If TRUE the leading/training white space is removed.

...

Other arguments passed to xml2::read_html().

Value

Returns a character vector.

References

The xpath is taken from Tony Breyal's response on StackOverflow: https://stackoverflow.com/questions/3195522/is-there-a-simple-way-in-r-to-extract-only-the-text-elements-of-an-html-page/3195926#3195926

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
html_dat <- read_html(
    system.file("docs/textreadr_creed.html", package = "textreadr")
)

## Not run: 
url <- "http://www.talkstats.com/index.php"
file <- download(url)
(txt <- read_html(url))
(txt <- read_html(file))

## End(Not run)

textreadr documentation built on Oct. 9, 2021, 5:06 p.m.