jericho: Break Down the Walls of 'HTML' Tags into Usable Text

Description Author(s)

Description

Structured 'HTML' content can be useful when you need to parse data tables or other tagged data from within a document. However, it is also useful to obtain "just the text" from a document free from the walls of tags that surround it. Tools are provied that wrap methods in the 'Jericho HTML Parser' Java library by Martin Jericho http://jericho.htmlparser.net/docs/index.html. Martin's library is used in many at-scale projects, icluding the 'The Internet Archive'.

Author(s)

Bob Rudis (bob@rud.is)


hrbrmstr/jericho documentation built on May 14, 2019, 9:35 a.m.