boilerpipeR-package: Extract the main content from HTML files

Description Author(s) See Also Examples

Description

boilerpipeR interfaces the boilerpipe Java library, created by Christian Kohlschutter https://github.com/kohlschutter/boilerpipe. It implements robust heuristics to extract the main content from HTML files, removing unessecary elements like ads, banners and headers/footers.

Author(s)

Mario Annau mario.annau@gmail

See Also

Extractor DefaultExtractor ArticleExtractor

Examples

1
2
3
4
5
6
## Not run: 
data(content)
extract <- DefaultExtractor(content)
cat(extract)

## End(Not run)

mannau/boilerpipeR documentation built on May 25, 2021, 10:01 a.m.