Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.
|Author||Mario Annau [aut, cre]|
|Date of publication||2014-05-29 13:46:03|
|Maintainer||Mario Annau <email@example.com>|
|License||Apache License (== 2.0)|
ArticleExtractor: A full-text extractor which is tuned towards news articles.
ArticleSentencesExtractor: A full-text extractor which is tuned towards extracting...
boilerpipeR-package: Extract the main content from HTML files
CanolaExtractor: A full-text extractor trained on a krdwrd Canola.
content: Wordpress generated Webpage (retrieved from Quantivity Blog...
DefaultExtractor: A quite generic full-text extractor.
Extractor: Generic extraction function which calls boilerpipe extractors
KeepEverythingExtractor: Marks everything as content.
LargestContentExtractor: A full-text extractor which extracts the largest text...
NumWordsRulesExtractor: A quite generic full-text extractor solely based upon the...