boilerpipeR: Interface to the Boilerpipe Java Library

Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe (http://code.google.com/p/boilerpipe/) Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.

AuthorSee AUTHORS file.
Date of publication2015-05-11 00:20:25
MaintainerMario Annau <mario.annau@gmail.com>
LicenseApache License (== 2.0)
Version1.3
https://github.com/mannau/boilerpipeR

View on CRAN

Files in this package

boilerpipeR
boilerpipeR/inst
boilerpipeR/inst/NEWS.Rd
boilerpipeR/inst/AUTHORS
boilerpipeR/inst/java
boilerpipeR/inst/java/xerces-2.9.1.jar
boilerpipeR/inst/java/nekohtml-1.9.13.jar
boilerpipeR/inst/java/boilerpipe-1.2.0.jar
boilerpipeR/inst/doc
boilerpipeR/inst/doc/ShortIntro.Rnw
boilerpipeR/inst/doc/ShortIntro.pdf
boilerpipeR/inst/doc/ShortIntro.R
boilerpipeR/NAMESPACE
boilerpipeR/data
boilerpipeR/data/content.rda
boilerpipeR/R
boilerpipeR/R/onload.R boilerpipeR/R/boilerpipeR-package.R boilerpipeR/R/Extractor.R
boilerpipeR/vignettes
boilerpipeR/vignettes/figures
boilerpipeR/vignettes/figures/blogpicture.pdf
boilerpipeR/vignettes/ShortIntro.Rnw
boilerpipeR/vignettes/references.bib
boilerpipeR/MD5
boilerpipeR/java
boilerpipeR/java/README
boilerpipeR/build
boilerpipeR/build/vignette.rds
boilerpipeR/DESCRIPTION
boilerpipeR/man
boilerpipeR/man/content.Rd boilerpipeR/man/Extractor.Rd boilerpipeR/man/ArticleExtractor.Rd boilerpipeR/man/CanolaExtractor.Rd boilerpipeR/man/KeepEverythingExtractor.Rd boilerpipeR/man/boilerpipeR-package.Rd boilerpipeR/man/LargestContentExtractor.Rd boilerpipeR/man/ArticleSentencesExtractor.Rd boilerpipeR/man/DefaultExtractor.Rd boilerpipeR/man/NumWordsRulesExtractor.Rd

Questions? Problems? Suggestions? or email at ian@mutexlabs.com.

All documentation is copyright its authors; we didn't write any of that.