In gershomtripp/gttoolkit: Gershom Tripp's Toolkit

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

gttoolkit

gttoolkit is a library where I collect custom functions for personal workflows. As such, it will continue to evolve as I encounter problems that other libraries cannot solve.

Installation

Because gttoolkit is an eternal work-in-progress there is no released version. Use the developer version instead:

devtools::install_github("gershomtripp/gttoolkit")

Example

Here's a summary of the current tools:

parse_table: Flexibly extract any kind of data from HTML tables. This includes not only text (rvest::html_table already does this), but also HTML element attributes (href, style, etc.), element names (th, td, div), raw HTML, etc.
rows_to_names: Flexibly combine any number of rows as column names. This function is similar to janitor::row_to_names, except that it allows the user to extract and concatenate headers and sub-headers, while also allowing some preprocessing of header rows via name_fn. This last feature is particularly important when using parse_table, which returns a data frame of list columns.

parse_table

An issue with rvest::html_table is that it doesn't offer any straightforward method to extract anything other than text from HTML tables, despite much useful information being in the HTML element attributes.

parse_table solves this issue by allowing the user to pass in a function to target specific data. By default parse_data just extracts text:

library(gttoolkit)
library(rvest)
library(dplyr)

url <- "https://en.wikipedia.org/wiki/Political_party_strength_in_Hawaii"

wiki_table <- read_html(url) %>%
  html_element(".wikitable")

hawaii_text_df <- parse_table(wiki_table) %>%
  mutate(across(.fns = unlist))

hawaii_text_df %>%
  tail()

For performance reasons parse_table doesn't attempt to remove white space, which involves the use of regular expressions. Most data needs further cleaning anyway, even after the white space has been removed, so it makes sense to chunk these tasks for a step farther down the pipeline.

It's also worth noting that parse_table always takes an object of class "xml_nodeset" as input and returns a data frame of list columns, which is essentially a list of lists, as output. Beyond being faster, this also allows the user the benefit of being able to extract anything that a list can contain, including vectors of n > 1, named vectors, lists, data frames, etc.

hawaii_links_df <- 
    parse_table(
        wiki_table,
        cell_fn = function(x) {
            html_elements(x, "a") %>%
                html_attr("href") %>%
                `names<-`(html_text2(html_elements(x, "a")))
        }
    )

# Taking a look at the contents of column V10, we see that the table contains
# named vectors of n > 1.
tail(hawaii_links_df$V10)

rows_to_names

The function row_to_names from the janitor library allows the user to turn single rows into column names. However, many tables contain header and sub-headers, which means row_to_names often falls short. gttoolkit's rows_to_headers helps by allowing the user to aggregate multiple rows into one vector of headers:

hawaii_text_df %>%
  rows_to_names(row_nums = 1:3) %>% 
  tail()

The usefulness of rows_to_names becomes apparent when the HTML table in question has multiple headers and sub-headers scattered across the entire length of the table. For example, the table of Political party strength in Hawaii has numerous divisions in a single table. To make sense of the data we should split the table, then recombine it once we've reorganized the parts.

But first, the very first row should be part of the headers of every component part. A simple call to rows_to_names with its defaults will take care of this, although we should take care to make sure the names are unique so that the data frame plays nicely with dplyr functions:

library(purrr)

hawaii_text_df <- 
  hawaii_text_df %>%
  rows_to_names() %>%
  set_names(make.unique(names(.), "...")) # Make names unique

hawaii_text_df %>% 
  head()

We can use parse_table again to extract HTML element names, which can tell us where the headers are so we can split the data frame appropriately:

name_props <- as.matrix(parse_table(wiki_table, html_name)) %>%
  {rowMeans(. == "th", na.rm = TRUE)}

Using name_props, we can create groups based on the location of the first sub-header of each sub-header group. This is necessary because the sub-headers have sub-headers. If we don't find the first sub-header we will end up with empty data frames:

hawaii_text_df_list <- hawaii_text_df %>%
  mutate(th = {name_props[-1] > .9}, # [-1] because the first row is now the names.
         group = map2_lgl(th,
                          lag(th),
                          # The first header of each group is the "th" element 
                          # without a "th" before it.
                          ~ ifelse(isTRUE(.x) && !isTRUE(.y), T, F)
                          ) %>% 
           cumsum()
         ) %>%
  group_split(group)

We can now iterate over our list of data frames and complete our headers by merging the remaining "th" rows with our current headers. The last element of the list is empty because of headers at the bottom of the original table, so we can drop it before iterating over the rest. While we're at it, we can also rename the column containing years and convert its values to integers:

hawaii_text_df_list <- 
  hawaii_text_df_list[-length(hawaii_text_df_list)] %>%
  map(~ rows_to_names(., row_nums = which(.$th), merge_current = TRUE) %>% 
        rename_with(~ ifelse(grepl("Year", ., T), "year", .)) %>% 
        mutate(year = as.integer(year)) %>% 
        select(-starts_with(c("th", "group"))) # Remove our "th" data to avoid confusion.
      )

# Let's take a peek one table in the list.
head(hawaii_text_df_list[[1]])

Now that we've aggregated our headers, it's fairly trivial to pivot to a "long" data frame and separate out the data found in the headers into variables:

library(tidyr)

hawaii_df_long <- 
map_dfr(
  hawaii_text_df_list,
  pivot_longer,
  cols = -year,
  names_to = "branch_form_office",
  values_to = "person_party"
) %>% 
  separate(branch_form_office, c("branch", "form", "office"), sep = "<<=>>") %>% 
  mutate(branch = sub("\\.{3}\\d{1,}", "", branch)) # Remove "unique" suffixes.

tail(hawaii_df_long)

Final note

gttoolkit is freely available to anyone who might find it useful. That said, the package is a catch-all for functions I've developed to surmount hurdles in my own workflow, so don't expect too much support. I also can't guarantee that it will work on every system, though I do try to keep things as system-agnostic as possible. Contributions are appreciated if you think you can improve something.

gershomtripp/gttoolkit documentation built on Dec. 20, 2021, 10:41 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com