extractContentDOM: Extract Main HTML Content from DOM

Description Usage Arguments Author(s) References See Also

View source: R/extract.R

Description

Function extracts main HTML Content using its Document Object Model. Idea comes basically from the fact, that main content of an HTML Document is in a subnode of the HTML DOM Tree with a high text-to-tag ratio. Internally, this function also calls assignValues, calcDensity, getMainText and removeTags.

Usage

1
extractContentDOM(url, threshold, asText = TRUE, ...)

Arguments

url

character, url or filename

threshold

threshold for extraction, defaults to 0.5

asText

boolean, specifies if url should be interpreted as character

...

Additional Parameters to htmlTreeParse

Author(s)

Mario Annau

References

http://www.elias.cn/En/ExtMainText, http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/ Gupta et al., DOM-based Content Extraction of HTML Documents,http://www2003.org/cdrom/papers/refereed/p583/p583-gupta.html

See Also

xmlNode


mannau/tm.plugin.webmining documentation built on May 21, 2019, 11:24 a.m.