Delete HTML or XML tags

Description

Function for removing markup tags (e.g. HTML, XML) from a string of characters. All XML markup is assumed to be compliant with the TEI guidelines (http://www.tei-c.org/).

Usage

1
delete.markup(input.text, markup.type = "plain")

Arguments

input.text

any string of characters (e.g. vector) containing markup tags that have to be deleted.

markup.type

any of the following values: plain (nothing will happen), html (all <tags> will be deleted as well as HTML header), xml (TEI header, all strings between <note> </note> tags, and all the tags will be deleted), xml.drama (as above; but, additionally, speaker's names will be deleted, or strings within each the <speaker> </speaker> tags), xml.notitles (as above; but, additionally, all the chapter/section (sub)titles will be deleted, or strings within each the <head> </head> tags).

Details

This function needs to be used carefully: while a document formatted in compliance with the TEI guidelines will be parsed flawlessly, the cleaning up of an HTML page harvested randomly on the web might cause some side effects, e.g. the footers, disclaimers, etc. will not be removed.

Author(s)

Maciej Eder, Mike Kestemont

See Also

load.corpus, txt.to.words, txt.to.words.ext, txt.to.features

Examples

1
2
3
4
5
6
7
8
  delete.markup("Gallia est omnis <i>divisa</i> in partes tres", 
           markup.type = "html")

  delete.markup("Gallia<note>Gallia: Gaul.</note> est omnis 
           <emph>divisa</emph> in partes tres", markup.type = "xml")

  delete.markup("<speaker>Hamlet</speaker>Words, words, words...", 
           markup.type = "xml.drama")

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.