hrbrmstr/jericho: Break Down the Walls of 'HTML' Tags into Usable Text

Structured 'HTML' content can be useful when you need to parse data tables or other tagged data from within a document. However, it is also useful to obtain "just the text" from a document free from the walls of tags that surround it. Tools are provied that wrap methods in the 'Jericho HTML Parser' Java library by Martin Jericho <http://jericho.htmlparser.net/docs/index.html>. Martin's library is used in many at-scale projects, icluding the 'The Internet Archive'.

Getting started

Package details

MaintainerBob Rudis <bob@rud.is>
LicenseApache License 2.0 | file LICENSE
Version0.2.0
URL https://gitlab.com/hrbrmstr/jericho
Package repositoryView on GitHub
Installation Install the latest version of this package by entering the following in R:
install.packages("remotes")
remotes::install_github("hrbrmstr/jericho")
hrbrmstr/jericho documentation built on May 14, 2019, 9:35 a.m.