boilerpipeR: Interface to the boilerpipe Java library by Christian Kohlschutter (http://code.google.com/p/boilerpipe/)

Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.

Author
Mario Annau [aut, cre]
Date of publication
2014-05-29 13:46:03
Maintainer
Mario Annau <mario.annau@gmail.com>
License
Apache License (== 2.0)
Version
1.2
URLs

View on R-Forge

Man pages

ArticleExtractor
A full-text extractor which is tuned towards news articles.
ArticleSentencesExtractor
A full-text extractor which is tuned towards extracting...
boilerpipeR-package
Extract the main content from HTML files
CanolaExtractor
A full-text extractor trained on a krdwrd Canola.
content
Wordpress generated Webpage (retrieved from Quantivity Blog...
DefaultExtractor
A quite generic full-text extractor.
Extractor
Generic extraction function which calls boilerpipe extractors
KeepEverythingExtractor
Marks everything as content.
LargestContentExtractor
A full-text extractor which extracts the largest text...
NumWordsRulesExtractor
A quite generic full-text extractor solely based upon the...

Files in this package

boilerpipeR/DESCRIPTION
boilerpipeR/NAMESPACE
boilerpipeR/NEWS
boilerpipeR/R
boilerpipeR/R/Extractor.R
boilerpipeR/R/boilerpipeR-package.R
boilerpipeR/R/onload.R
boilerpipeR/build
boilerpipeR/build/vignette.rds
boilerpipeR/data
boilerpipeR/data/content.rda
boilerpipeR/inst
boilerpipeR/inst/doc
boilerpipeR/inst/doc/ShortIntro.R
boilerpipeR/inst/doc/ShortIntro.Rnw
boilerpipeR/inst/doc/ShortIntro.pdf
boilerpipeR/inst/java
boilerpipeR/inst/java/boilerpipe-1.2.0.jar
boilerpipeR/inst/java/nekohtml-1.9.13.jar
boilerpipeR/inst/java/xerces-2.9.1.jar
boilerpipeR/man
boilerpipeR/man/ArticleExtractor.Rd
boilerpipeR/man/ArticleSentencesExtractor.Rd
boilerpipeR/man/CanolaExtractor.Rd
boilerpipeR/man/DefaultExtractor.Rd
boilerpipeR/man/Extractor.Rd
boilerpipeR/man/KeepEverythingExtractor.Rd
boilerpipeR/man/LargestContentExtractor.Rd
boilerpipeR/man/NumWordsRulesExtractor.Rd
boilerpipeR/man/boilerpipeR-package.Rd
boilerpipeR/man/content.Rd
boilerpipeR/vignettes
boilerpipeR/vignettes/ShortIntro.Rnw
boilerpipeR/vignettes/figures
boilerpipeR/vignettes/figures/blogpicture.pdf
boilerpipeR/vignettes/references.bib