knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
gttoolkit is a library where I collect custom functions for personal workflows. As such, it will continue to evolve as I encounter problems that other libraries cannot solve.
Because gttoolkit is an eternal work-in-progress there is no released version. Use the developer version instead:
devtools::install_github("gershomtripp/gttoolkit")
Here's a summary of the current tools:
rvest::html_table
already does this), but also
HTML element attributes (href, style, etc.), element names (th, td, div), raw
HTML, etc.janitor::row_to_names
, except that it allows the
user to extract and concatenate headers and sub-headers, while also allowing some
preprocessing of header rows via name_fn
. This last feature is particularly
important when using parse_table
, which returns a data frame of list columns.An issue with rvest::html_table
is that it doesn't offer any straightforward
method to extract anything other than text from HTML tables, despite much useful
information being in the HTML element attributes.
parse_table
solves this issue by allowing the user to pass in a function
to target specific data. By default parse_data
just extracts text:
library(gttoolkit) library(rvest) library(dplyr) url <- "https://en.wikipedia.org/wiki/Political_party_strength_in_Hawaii" wiki_table <- read_html(url) %>% html_element(".wikitable") hawaii_text_df <- parse_table(wiki_table) %>% mutate(across(.fns = unlist)) hawaii_text_df %>% tail()
For performance reasons parse_table
doesn't attempt to remove white space,
which involves the use of regular expressions. Most data needs further cleaning
anyway, even after the white space has been removed, so it makes sense to chunk
these tasks for a step farther down the pipeline.
It's also worth noting that parse_table
always takes an object of class
"xml_nodeset" as input and returns a data frame of list columns, which is essentially
a list of lists, as output. Beyond being faster, this also allows the user the
benefit of being able to extract anything that a list can contain, including vectors of
n > 1, named vectors, lists, data frames, etc.
hawaii_links_df <- parse_table( wiki_table, cell_fn = function(x) { html_elements(x, "a") %>% html_attr("href") %>% `names<-`(html_text2(html_elements(x, "a"))) } ) # Taking a look at the contents of column V10, we see that the table contains # named vectors of n > 1. tail(hawaii_links_df$V10)
The function row_to_names
from the janitor library allows the user to turn
single rows into column names. However, many tables contain header and sub-headers,
which means row_to_names
often falls short. gttoolkit's rows_to_headers
helps by allowing the user to aggregate multiple rows into one vector of headers:
hawaii_text_df %>% rows_to_names(row_nums = 1:3) %>% tail()
The usefulness of rows_to_names
becomes apparent when the HTML table in
question has multiple headers and sub-headers scattered across the entire length
of the table. For example, the table of Political party strength in Hawaii
has numerous divisions in a single table. To make sense of the data we should
split the table, then recombine it once we've reorganized the parts.
But first, the very first row should be part of the headers of every component
part. A simple call to rows_to_names
with its defaults will take care of this,
although we should take care to make sure the names are unique so that the
data frame plays nicely with dplyr functions:
library(purrr) hawaii_text_df <- hawaii_text_df %>% rows_to_names() %>% set_names(make.unique(names(.), "...")) # Make names unique hawaii_text_df %>% head()
We can use parse_table
again to extract HTML element names, which can tell us
where the headers are so we can split the data frame appropriately:
name_props <- as.matrix(parse_table(wiki_table, html_name)) %>% {rowMeans(. == "th", na.rm = TRUE)}
Using name_props
, we can create groups based on the location of the first
sub-header of each sub-header group. This is necessary because the sub-headers have
sub-headers. If we don't find the first sub-header we will end up with empty
data frames:
hawaii_text_df_list <- hawaii_text_df %>% mutate(th = {name_props[-1] > .9}, # [-1] because the first row is now the names. group = map2_lgl(th, lag(th), # The first header of each group is the "th" element # without a "th" before it. ~ ifelse(isTRUE(.x) && !isTRUE(.y), T, F) ) %>% cumsum() ) %>% group_split(group)
We can now iterate over our list of data frames and complete our headers by merging the remaining "th" rows with our current headers. The last element of the list is empty because of headers at the bottom of the original table, so we can drop it before iterating over the rest. While we're at it, we can also rename the column containing years and convert its values to integers:
hawaii_text_df_list <- hawaii_text_df_list[-length(hawaii_text_df_list)] %>% map(~ rows_to_names(., row_nums = which(.$th), merge_current = TRUE) %>% rename_with(~ ifelse(grepl("Year", ., T), "year", .)) %>% mutate(year = as.integer(year)) %>% select(-starts_with(c("th", "group"))) # Remove our "th" data to avoid confusion. ) # Let's take a peek one table in the list. head(hawaii_text_df_list[[1]])
Now that we've aggregated our headers, it's fairly trivial to pivot to a "long" data frame and separate out the data found in the headers into variables:
library(tidyr) hawaii_df_long <- map_dfr( hawaii_text_df_list, pivot_longer, cols = -year, names_to = "branch_form_office", values_to = "person_party" ) %>% separate(branch_form_office, c("branch", "form", "office"), sep = "<<=>>") %>% mutate(branch = sub("\\.{3}\\d{1,}", "", branch)) # Remove "unique" suffixes. tail(hawaii_df_long)
gttoolkit is freely available to anyone who might find it useful. That said, the package is a catch-all for functions I've developed to surmount hurdles in my own workflow, so don't expect too much support. I also can't guarantee that it will work on every system, though I do try to keep things as system-agnostic as possible. Contributions are appreciated if you think you can improve something.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.