boilerpipeR: Interface to the boilerpipe Java library by Christian Kohlschutter (http://code.google.com/p/boilerpipe/)

Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.

AuthorMario Annau [aut, cre]
Date of publication2014-05-29 13:46:03
MaintainerMario Annau <mario.annau@gmail.com>
LicenseApache License (== 2.0)
Version1.2
https://github.com/mannau/boilerpipeR

View on R-Forge

Files

DESCRIPTION
NAMESPACE
NEWS
R
R/Extractor.R R/boilerpipeR-package.R R/onload.R
build
build/vignette.rds
data
data/content.rda
inst
inst/doc
inst/doc/ShortIntro.R
inst/doc/ShortIntro.Rnw
inst/doc/ShortIntro.pdf
inst/java
inst/java/boilerpipe-1.2.0.jar
inst/java/nekohtml-1.9.13.jar
inst/java/xerces-2.9.1.jar
man
man/ArticleExtractor.Rd man/ArticleSentencesExtractor.Rd man/CanolaExtractor.Rd man/DefaultExtractor.Rd man/Extractor.Rd man/KeepEverythingExtractor.Rd man/LargestContentExtractor.Rd man/NumWordsRulesExtractor.Rd man/boilerpipeR-package.Rd man/content.Rd
vignettes
vignettes/ShortIntro.Rnw
vignettes/figures
vignettes/figures/blogpicture.pdf
vignettes/references.bib

Questions? Problems? Suggestions? or email at ian@mutexlabs.com.

Please suggest features or report bugs with the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.