A common use of the chunked_pmap
function is to iteratively parse XML that is stored in a column within a database table. This pattern can be applied to any large query result set that requires a custom function to be applied to each row.
The example below details how one would use the xml_extract
function along with chunked_pmap
to iteratively parse XML. Obviously the result set below will fit entirely into memory and doesn't need to be chunked. This is to demonstrate a pattern than can be used with much larger result sets.
The example below requires the DBI
, dplyr
, and RSQLite
, packages. The function xml_extract
requires purrr
and xml2
.
The code below prepares an RSQLite
in-memory database based on the mtcars
dataset. The code also will create a new column, xml_col
, that will be used to demonstrate XML parsing.
library("CRAmisc") suppressMessages(library("dplyr")) # function to create an XML document xml_create <- function(car, hp, cyl) { xml_doc <- paste0("<root><car>", car, "</car><stats hp=\"", hp, "\" cyl=\"", cyl, "\"></stats></root>") gsub("[\r\n]", "", xml_doc) } mtcars_with_xml <- mtcars %>% mutate(car = attr(., "row.names"), xml_col = xml_create(car = car, hp = hp, cyl = cyl)) %>% select(car, mpg, hp, cyl, xml_col) # only keep 5 columns # in memory SQLite database dbcon <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:") invisible(DBI::dbWriteTable(dbcon, "mtcars_with_xml", mtcars_with_xml))
Next, create a query and send to the database. The function DBI::dbSendQuery
will submit a synchronous query to be run on the database and will block until it has completed. But the result set is left on the database and is not brought back until DBI::dbFetch
is called.
sql_query <- "SELECT * FROM mtcars_with_xml where hp > 100" res <- DBI::dbSendQuery(dbcon, sql_query)
The function that is passed to chunked_pmap
should take in a dataframe and return a transformed dataframe. The function below makes use of the xml_extract
function which is just a convenience wrapper around the parsing functions found in the xml2
package. In addition, xml_extract
makes use of purrr::safely
to silence parsing errors. This is useful when working with large result sets.
# f should accept a dataframe and return a dataframe f <- function(df) { df %>% dplyr::rowwise() %>% dplyr::mutate( xml_car = xml_extract( x = xml_col, xpath = "//root/car", extract_type = "text" ), xml_hp = xml_extract( x = xml_col, xpath = "//root/stats", extract_type = "attr", extract_value = "hp" ), xml_cyl = xml_extract( x = xml_col, xpath = "//root/stats", extract_type = "attr", extract_value = "cyl" ) ) }
Invoke chunked_pmap
to iteratively parse the XML document within each row of the dataframe.
# process 5 rows at a time mtcars_with_xml_parsed <- chunked_pmap(res, f, 5) %>% dplyr::select(-hp, -mpg, -cyl) knitr::kable(mtcars_with_xml_parsed, format = "html", padding = 2)
Close the database connection. The result set does not need to be cleared (DBI::dbClearResult(res)
) as that occurs within chunked_pmap
.
DBI::dbDisconnect(dbcon)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.