read_html: Read in .html Content
In textreadr: Read Text Documents into R

Description Usage Arguments Value References Examples

Read in the content from a .html file. This is generalized, reading in all body text. For finer control the user should utilize the xml2 and rvest packages.

1
2
3

read_html(file, skip = 0, remove.empty = TRUE, trim = TRUE, ...)

read_xml(file, skip = 0, remove.empty = TRUE, trim = TRUE, ...)

`file`	The path to the .html file.
`skip`	The number of lines to skip.
`remove.empty`	logical. If `TRUE` empty elements in the vector are removed.
`trim`	logical. If `TRUE` the leading/training white space is removed.
`...`	Other arguments passed to xml2::read_html().

Returns a character vector.

The xpath is taken from Tony Breyal's response on StackOverflow: https://stackoverflow.com/questions/3195522/is-there-a-simple-way-in-r-to-extract-only-the-text-elements-of-an-html-page/3195926#3195926

html_dat <- read_html(
    system.file("docs/textreadr_creed.html", package = "textreadr")
)

## Not run: 
url <- "http://www.talkstats.com/index.php"
file <- download(url)
(txt <- read_html(url))
(txt <- read_html(file))

## End(Not run)