knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This is a package where I collected some of the function I have used when dealing with data.
library(xutils)
html_decode
Currently, there is only one function: html_decode
which will replace the HTML entities like
&
into their original form &
.
This function is a thin-wrapper of C++ code provided by Christoph on Stack Overflow.
An example would be
strings <- c("abcd", "& ' >", "&", "€ <") html_decode(strings)
It works very well!
To the best of my knowledge, there are already several solutions to this problem, and why do I need to wrap up a new function to do this? Because of performance.
First of all, there is an existing package textutils
that contains lots of functions dealing with data.
The one of our interest is HTMLdecode
.
Second, there is a function by SO user Stibu
here
that uses xml2
package.
And the function is:
unescape_html2 <- function(str){ html <- paste0("<x>", paste0(str, collapse = "#_|"), "</x>") parsed <- xml2::xml_text(xml2::read_html(html)) strsplit(parsed, "#_|", fixed = TRUE)[[1]] }
Third, I took the code from Christoph
(here)
and wrote a R wrapper for the C function.
This function is xutils::html_decode
.
Now, let's test the performance and I use bench
package here.
bench::mark( html_decode(strings), unescape_html2(strings), textutils::HTMLdecode(strings) )
Clearly, the speed of html_decode
function is unparalleled.
Note:
When testing the results, I discovered a bug in textutils::HTMLdecode
and reported it
here. The maintainer fixed it immediately.
As of this writing (Feb. 16, 2021), the development version of textutils
has this bug fixed,
but the CRAN version may not. This means that if you test the performance yourself with a previous version
of textutils
, you may run into error and installing the development version will solve for it.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.