Introduction

A common use of the chunked_pmap function is to iteratively parse XML that is stored in a column within a database table. This pattern can be applied to any large query result set that requires a custom function to be applied to each row.

Parsing XML

The example below details how one would use the xml_extract function along with chunked_pmap to iteratively parse XML. Obviously the result set below will fit entirely into memory and doesn't need to be chunked. This is to demonstrate a pattern than can be used with much larger result sets.

The example below requires the DBI, dplyr, and RSQLite, packages. The function xml_extract requires purrr and xml2.

Prepare Database

The code below prepares an RSQLite in-memory database based on the mtcars dataset. The code also will create a new column, xml_col, that will be used to demonstrate XML parsing.

library("CRAmisc")
suppressMessages(library("dplyr"))

# function to create an XML document
xml_create <- function(car, hp, cyl) {
  xml_doc <- paste0("<root><car>",
                    car,
                    "</car><stats hp=\"",
                    hp,
                    "\" cyl=\"",
                    cyl,
                    "\"></stats></root>")
  gsub("[\r\n]", "", xml_doc)
}

mtcars_with_xml <- mtcars %>%
  mutate(car = attr(., "row.names"),
         xml_col = xml_create(car = car,
                              hp = hp,
                              cyl = cyl)) %>%
  select(car, mpg, hp, cyl, xml_col)  # only keep 5 columns

# in memory SQLite database
dbcon <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
invisible(DBI::dbWriteTable(dbcon, "mtcars_with_xml", mtcars_with_xml))

Create Query and Send to Database

Next, create a query and send to the database. The function DBI::dbSendQuery will submit a synchronous query to be run on the database and will block until it has completed. But the result set is left on the database and is not brought back until DBI::dbFetch is called.

sql_query <- "SELECT * FROM mtcars_with_xml where hp > 100"
res <- DBI::dbSendQuery(dbcon, sql_query)

Write Custom Parsing Function

The function that is passed to chunked_pmap should take in a dataframe and return a transformed dataframe. The function below makes use of the xml_extract function which is just a convenience wrapper around the parsing functions found in the xml2 package. In addition, xml_extract makes use of purrr::safely to silence parsing errors. This is useful when working with large result sets.

# f should accept a dataframe and return a dataframe
f <- function(df) {
  df %>%
    dplyr::rowwise() %>%
    dplyr::mutate(
      xml_car = xml_extract(
        x = xml_col,
        xpath = "//root/car",
        extract_type = "text"
      ),
      xml_hp = xml_extract(
        x = xml_col,
        xpath = "//root/stats",
        extract_type = "attr",
        extract_value = "hp"
      ),
      xml_cyl = xml_extract(
        x = xml_col,
        xpath = "//root/stats",
        extract_type = "attr",
        extract_value = "cyl"
      )
    )
}

Iterate Over Chunks

Invoke chunked_pmap to iteratively parse the XML document within each row of the dataframe.

# process 5 rows at a time
mtcars_with_xml_parsed <- chunked_pmap(res, f, 5) %>%
  dplyr::select(-hp, -mpg, -cyl)
knitr::kable(mtcars_with_xml_parsed, format = "html", padding = 2)

Cleanup

Close the database connection. The result set does not need to be cleared (DBI::dbClearResult(res)) as that occurs within chunked_pmap.

DBI::dbDisconnect(dbcon)


curtisalexander/CRAmisc documentation built on May 14, 2019, 12:52 p.m.