boilerpipeR: Interface to the Boilerpipe Java Library

Share:

Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe (http://code.google.com/p/boilerpipe/) Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.

Author
See AUTHORS file.
Date of publication
2015-05-11 00:20:25
Maintainer
Mario Annau <mario.annau@gmail.com>
License
Apache License (== 2.0)
Version
1.3
URLs

View on CRAN

Man pages

ArticleExtractor
A full-text extractor which is tuned towards news articles.
ArticleSentencesExtractor
A full-text extractor which is tuned towards extracting...
boilerpipeR-package
Extract the main content from HTML files
CanolaExtractor
A full-text extractor trained on a 'krdwrd' Canola (see...
content
Wordpress generated Webpage (retrieved from Quantivity Blog...
DefaultExtractor
A quite generic full-text extractor.
Extractor
Generic extraction function which calls boilerpipe extractors
KeepEverythingExtractor
Marks everything as content.
LargestContentExtractor
A full-text extractor which extracts the largest text...
NumWordsRulesExtractor
A quite generic full-text extractor solely based upon the...

Files in this package

boilerpipeR
boilerpipeR/inst
boilerpipeR/inst/NEWS.Rd
boilerpipeR/inst/AUTHORS
boilerpipeR/inst/java
boilerpipeR/inst/java/xerces-2.9.1.jar
boilerpipeR/inst/java/nekohtml-1.9.13.jar
boilerpipeR/inst/java/boilerpipe-1.2.0.jar
boilerpipeR/inst/doc
boilerpipeR/inst/doc/ShortIntro.Rnw
boilerpipeR/inst/doc/ShortIntro.pdf
boilerpipeR/inst/doc/ShortIntro.R
boilerpipeR/NAMESPACE
boilerpipeR/data
boilerpipeR/data/content.rda
boilerpipeR/R
boilerpipeR/R/onload.R
boilerpipeR/R/boilerpipeR-package.R
boilerpipeR/R/Extractor.R
boilerpipeR/vignettes
boilerpipeR/vignettes/figures
boilerpipeR/vignettes/figures/blogpicture.pdf
boilerpipeR/vignettes/ShortIntro.Rnw
boilerpipeR/vignettes/references.bib
boilerpipeR/MD5
boilerpipeR/java
boilerpipeR/java/README
boilerpipeR/build
boilerpipeR/build/vignette.rds
boilerpipeR/DESCRIPTION
boilerpipeR/man
boilerpipeR/man/content.Rd
boilerpipeR/man/Extractor.Rd
boilerpipeR/man/ArticleExtractor.Rd
boilerpipeR/man/CanolaExtractor.Rd
boilerpipeR/man/KeepEverythingExtractor.Rd
boilerpipeR/man/boilerpipeR-package.Rd
boilerpipeR/man/LargestContentExtractor.Rd
boilerpipeR/man/ArticleSentencesExtractor.Rd
boilerpipeR/man/DefaultExtractor.Rd
boilerpipeR/man/NumWordsRulesExtractor.Rd