delete.markup: Delete HTML or XML tags

Description Usage Arguments Details Author(s) See Also Examples

View source: R/delete.markup.R

Description

Function for removing markup tags (e.g. HTML, XML) from a string of characters. All XML markup is assumed to be compliant with the TEI guidelines (https://tei-c.org/).

Usage

1
delete.markup(input.text, markup.type = "plain")

Arguments

input.text

any string of characters (e.g. vector) containing markup tags that have to be deleted.

markup.type

any of the following values: plain (nothing will happen), html (all <tags> will be deleted as well as HTML header), xml (TEI header, all strings between <note> </note> tags, and all the tags will be deleted), xml.drama (as above; but, additionally, speaker's names will be deleted, or strings within each the <speaker> </speaker> tags), xml.notitles (as above; but, additionally, all the chapter/section (sub)titles will be deleted, or strings within each the <head> </head> tags).

Details

This function needs to be used carefully: while a document formatted in compliance with the TEI guidelines will be parsed flawlessly, the cleaning up of an HTML page harvested randomly on the web might cause some side effects, e.g. the footers, disclaimers, etc. will not be removed.

Author(s)

Maciej Eder, Mike Kestemont

See Also

load.corpus, txt.to.words, txt.to.words.ext, txt.to.features

Examples

1
2
3
4
5
6
7
8
  delete.markup("Gallia est omnis <i>divisa</i> in partes tres", 
           markup.type = "html")

  delete.markup("Gallia<note>Gallia: Gaul.</note> est omnis 
           <emph>divisa</emph> in partes tres", markup.type = "xml")

  delete.markup("<speaker>Hamlet</speaker>Words, words, words...", 
           markup.type = "xml.drama")

Example output

### stylo version: 0.6.9 ###

If you plan to cite this software (please do!), use the following reference:
    Eder, M., Rybicki, J. and Kestemont, M. (2016). Stylometry with R:
    a package for computational text analysis. R Journal 8(1): 107-121.
    <https://journal.r-project.org/archive/2016/RJ-2016-007/index.html>

To get full BibTeX entry, type: citation("stylo")
Warning message:
no DISPLAY variable so Tk is not available 
[1] "Gallia est omnis divisa in partes tres"
[1] "Gallia est omnis \n           divisa in partes tres"
[1] "Words, words, words..."

stylo documentation built on Dec. 6, 2020, 5:06 p.m.

Related to delete.markup in stylo...