hrbrmstr/jericho: Break Down the Walls of 'HTML' Tags into Usable Text

Structured 'HTML' content can be useful when you need to parse data tables or other tagged data from within a document. However, it is also useful to obtain "just the text" from a document free from the walls of tags that surround it. Tools are provied that wrap methods in the 'Jericho HTML Parser' Java library by Martin Jericho <http://jericho.htmlparser.net/docs/index.html>. Martin's library is used in many at-scale projects, icluding the 'The Internet Archive'.

Getting started

Package details

MaintainerBob Rudis <[email protected]>
LicenseApache License 2.0 | file LICENSE
Version0.2.0
URL https://github.com/hrbrmstr/jericho
Package repositoryView on GitHub
Installation Install the latest version of this package by entering the following in R:
install.packages("devtools")
library(devtools)
install_github("hrbrmstr/jericho")
hrbrmstr/jericho documentation built on Sept. 6, 2017, 4:30 p.m.