mannau/boilerpipeR: Interface to the Boilerpipe Java Library

Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe <https://github.com/kohlschutter/boilerpipe> Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.

Getting started

Package details

AuthorSee AUTHORS file.
MaintainerMario Annau <mario.annau@gmail.com>
LicenseApache License (== 2.0)
Version1.3.2
URL https://github.com/mannau/boilerpipeR
Package repositoryView on GitHub
Installation Install the latest version of this package by entering the following in R:
install.packages("remotes")
remotes::install_github("mannau/boilerpipeR")
mannau/boilerpipeR documentation built on May 25, 2021, 10:01 a.m.