boilerpipeR: Interface to the boilerpipe Java library by Christian Kohlschutter (http://code.google.com/p/boilerpipe/)

Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.

AuthorMario Annau [aut, cre]
Date of publication2014-05-29 13:46:03
MaintainerMario Annau <mario.annau@gmail.com>
LicenseApache License (== 2.0)
Version1.2
https://github.com/mannau/boilerpipeR

View on R-Forge

Files in this package

boilerpipeR/DESCRIPTION
boilerpipeR/NAMESPACE
boilerpipeR/NEWS
boilerpipeR/R
boilerpipeR/R/Extractor.R boilerpipeR/R/boilerpipeR-package.R boilerpipeR/R/onload.R
boilerpipeR/build
boilerpipeR/build/vignette.rds
boilerpipeR/data
boilerpipeR/data/content.rda
boilerpipeR/inst
boilerpipeR/inst/doc
boilerpipeR/inst/doc/ShortIntro.R
boilerpipeR/inst/doc/ShortIntro.Rnw
boilerpipeR/inst/doc/ShortIntro.pdf
boilerpipeR/inst/java
boilerpipeR/inst/java/boilerpipe-1.2.0.jar
boilerpipeR/inst/java/nekohtml-1.9.13.jar
boilerpipeR/inst/java/xerces-2.9.1.jar
boilerpipeR/man
boilerpipeR/man/ArticleExtractor.Rd boilerpipeR/man/ArticleSentencesExtractor.Rd boilerpipeR/man/CanolaExtractor.Rd boilerpipeR/man/DefaultExtractor.Rd boilerpipeR/man/Extractor.Rd boilerpipeR/man/KeepEverythingExtractor.Rd boilerpipeR/man/LargestContentExtractor.Rd boilerpipeR/man/NumWordsRulesExtractor.Rd boilerpipeR/man/boilerpipeR-package.Rd boilerpipeR/man/content.Rd
boilerpipeR/vignettes
boilerpipeR/vignettes/ShortIntro.Rnw
boilerpipeR/vignettes/figures
boilerpipeR/vignettes/figures/blogpicture.pdf
boilerpipeR/vignettes/references.bib

Questions? Problems? Suggestions? or email at ian@mutexlabs.com.

All documentation is copyright its authors; we didn't write any of that.