unescape_markup: Clean up xml or html markup tags and formatting

View source: R/unescape_markup.R

unescape_markupR Documentation

Clean up xml or html markup tags and formatting

Description

This is a minor modification of http://stackoverflow.com/questions/5060076/convert-html-character-entity-encoding-in-r, and all credit is due.

This function will call either xml2::read_xml() or xml2::read_html(), depending on the value passed to the argument. The default, if not specified, is html.

If called with iconv_encoding == TRUE, x is processed by iconv, which may or may not change x. In both the spirit of minimizing surprises, and with particular note to the potential of an early return if no unescaping is required, iconv_encoding is FALSE by default, and therefore any args that would be passed to iconv() via ... are ignored.

Usage

unescape_markup(x, what_ml = c("html", "xml"), iconv_encoding = FALSE, ...)

Arguments

x

A character; the input you wish to unescape

what_ml

One of xml, html to denote if content should be handled as such. Defaults to html

iconv_encoding

A logical vector of length 1. Should the input be processed via iconv?

...

Optional. Additional args to iconv and used when iconv_encoding is TRUE

Details

Useful when dealing with '< >' enclosed parts of strings in a vector

Value

A character vector the same length of x, with <x> unescaped. If no unescaping was required, will return x as is, by default.

Note

The xml2 functions this relies upon are not vectorized (this is a different use case, so no criticism is implied re: the functions themselves). The actual function handles vector inputs of length >1 through vapply(), and should maintain a reasonable level of performance by first subsetting only those elements of x where <.+> is present. Therefore, if there are only a few elements of x that require this function, performance should be acceptable; runtimes will therefore increase on an as-needed basis, and not solely as a function of length(x).

Examples

x <- "<i>in-situ</i> electron microscopy"
unescape_markup(x)

slin30/wzMisc documentation built on Jan. 27, 2023, 1 a.m.