boilerpipeR: Interface to the Boilerpipe Java Library
Version 1.3

Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe (http://code.google.com/p/boilerpipe/) Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.

AuthorSee AUTHORS file.
Date of publication2015-05-11 00:20:25
MaintainerMario Annau <mario.annau@gmail.com>
LicenseApache License (== 2.0)
Version1.3
URL https://github.com/mannau/boilerpipeR
Package repositoryView on CRAN
InstallationInstall the latest version of this package by entering the following in R:
install.packages("boilerpipeR")

Getting started

Package overview

Popular man pages

ArticleExtractor: A full-text extractor which is tuned towards news articles.
boilerpipeR-package: Extract the main content from HTML files
CanolaExtractor: A full-text extractor trained on a 'krdwrd' Canola (see...
content: Wordpress generated Webpage (retrieved from Quantivity Blog...
DefaultExtractor: A quite generic full-text extractor.
LargestContentExtractor: A full-text extractor which extracts the largest text...
NumWordsRulesExtractor: A quite generic full-text extractor solely based upon the...
See all...

All man pages Function index File listing

Man pages

ArticleExtractor: A full-text extractor which is tuned towards news articles.
ArticleSentencesExtractor: A full-text extractor which is tuned towards extracting...
boilerpipeR-package: Extract the main content from HTML files
CanolaExtractor: A full-text extractor trained on a 'krdwrd' Canola (see...
content: Wordpress generated Webpage (retrieved from Quantivity Blog...
DefaultExtractor: A quite generic full-text extractor.
Extractor: Generic extraction function which calls boilerpipe extractors
KeepEverythingExtractor: Marks everything as content.
LargestContentExtractor: A full-text extractor which extracts the largest text...
NumWordsRulesExtractor: A quite generic full-text extractor solely based upon the...

Functions

ArticleExtractor Man page Source code
ArticleSentencesExtractor Man page Source code
CanolaExtractor Man page Source code
DefaultExtractor Man page Source code
Extractor Man page Source code
KeepEverythingExtractor Man page Source code
LargestContentExtractor Man page Source code
NumWordsRulesExtractor Man page Source code
boilerpipe Man page
boilerpipeR-package Man page
content Man page
onLoad Source code

Files

inst
inst/NEWS.Rd
inst/AUTHORS
inst/java
inst/java/xerces-2.9.1.jar
inst/java/nekohtml-1.9.13.jar
inst/java/boilerpipe-1.2.0.jar
inst/doc
inst/doc/ShortIntro.Rnw
inst/doc/ShortIntro.pdf
inst/doc/ShortIntro.R
NAMESPACE
data
data/content.rda
R
R/onload.R
R/boilerpipeR-package.R
R/Extractor.R
vignettes
vignettes/figures
vignettes/figures/blogpicture.pdf
vignettes/ShortIntro.Rnw
vignettes/references.bib
MD5
java
java/README
build
build/vignette.rds
DESCRIPTION
man
man/content.Rd
man/Extractor.Rd
man/ArticleExtractor.Rd
man/CanolaExtractor.Rd
man/KeepEverythingExtractor.Rd
man/boilerpipeR-package.Rd
man/LargestContentExtractor.Rd
man/ArticleSentencesExtractor.Rd
man/DefaultExtractor.Rd
man/NumWordsRulesExtractor.Rd
boilerpipeR documentation built on May 19, 2017, 8:27 a.m.

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.

Please suggest features or report bugs in the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.