Description Usage Arguments Author(s) References See Also
Function extracts main HTML Content using its Document Object Model.
Idea comes basically from the fact, that main content of an HTML Document
is in a subnode of the HTML DOM Tree with a high text-to-tag ratio.
Internally, this function also calls
assignValues
, calcDensity
, getMainText
and removeTags
.
1 | extractContentDOM(url, threshold, asText = TRUE, ...)
|
url |
character, url or filename |
threshold |
threshold for extraction, defaults to 0.5 |
asText |
boolean, specifies if url should be interpreted as character |
... |
Additional Parameters to |
Mario Annau
http://www.elias.cn/En/ExtMainText, http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/ Gupta et al., DOM-based Content Extraction of HTML Documents,http://www2003.org/cdrom/papers/refereed/p583/p583-gupta.html
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.