gttoolkit is a library where I collect custom functions for personal workflows. As such, it will continue to evolve as I encounter problems that other libraries cannot solve.
Because gttoolkit is an eternal work-in-progress there is no released version. Use the developer version instead:
devtools::install_github("gershomtripp/gttoolkit")
Here’s a summary of the current tools:
rvest::html_table
already does this), but also HTML element attributes (href, style,
etc.), element names (th, td, div), raw HTML, etc.janitor::row_to_names
, except that it allows the user to extract
and concatenate headers and sub-headers, while also allowing some
preprocessing of header rows via name_fn
. This last feature is
particularly important when using parse_table
, which returns a
data frame of list columns.An issue with rvest::html_table
is that it doesn’t offer any
straightforward method to extract anything other than text from HTML
tables, despite much useful information being in the HTML element
attributes.
parse_table
solves this issue by allowing the user to pass in a
function to target specific data. By default parse_data
just extracts
text:
library(gttoolkit)
library(rvest)
library(dplyr)
#>
#> Attache Paket: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
url <- "https://en.wikipedia.org/wiki/Political_party_strength_in_Hawaii"
wiki_table <- read_html(url) %>%
html_element(".wikitable")
hawaii_text_df <- parse_table(wiki_table) %>%
mutate(across(.fns = unlist))
hawaii_text_df %>%
tail()
#> V1 V2 V3 V4
#> 244 2018\n David Ige (D)\n Doug Chin (D)[31] 25D\n
#> 245 2019\n David Ige (D)\n Josh Green (D)\n 24D, 1R\n
#> 246 2020\n David Ige (D)\n Josh Green (D)\n 24D, 1R\n
#> 247 2021\n David Ige (D)\n Josh Green (D)\n 24D, 1R\n
#> 248 Year Governor Lieutenant Governor State Senate
#> 249 Year Executive offices Executive offices State Legislature
#> V5 V6 V7
#> 244 46D, 5R\n Mazie Hirono (D)\n \n
#> 245 46D, 5R\n Mazie Hirono (D)\n \n
#> 246 46D, 5R\n Mazie Hirono (D)\n \n
#> 247 47D, 4R\n Mazie Hirono (D)\n \n
#> 248 State House U.S. Senator (Class I) U.S. Senator (Class III)
#> 249 State Legislature U.S. Congress U.S. Congress
#> V8 V9
#> 244 Colleen Hanabusa (D)\n Tulsi Gabbard (D)\n
#> 245 Ed Case (D)\n Tulsi Gabbard (D)\n
#> 246 Ed Case (D)\n Tulsi Gabbard (D)\n
#> 247 Ed Case (D)\n Kai Kahele (D)\n
#> 248 U.S. House District 1 U.S. House District 2
#> 249 U.S. Congress U.S. Congress
#> V10
#> 244 Hillary Clinton and Tim Kaine (D) N\n
#> 245 Hillary Clinton and Tim Kaine (D) N\n
#> 246 Joe Biden and Kamala Harris (D) Y\n
#> 247 Joe Biden and Kamala Harris (D) Y\n
#> 248 Presidential Elections\n
#> 249 Electoral College votes\n
For performance reasons parse_table
doesn’t attempt to remove white
space, which involves the use of regular expressions. Most data needs
further cleaning anyway, even after the white space has been removed, so
it makes sense to chunk these tasks for a step farther down the
pipeline.
It’s also worth noting that parse_table
always takes an object of
class “xml_nodeset” as input and returns a data frame of list columns,
which is essentially a list of lists, as output. Beyond being faster,
this also allows the user the benefit of being able to extract anything
that a list can contain, including vectors of n > 1, named vectors,
lists, data frames, etc.
hawaii_links_df <-
parse_table(
wiki_table,
cell_fn = function(x) {
html_elements(x, "a") %>%
html_attr("href") %>%
`names<-`(html_text2(html_elements(x, "a")))
}
)
# Taking a look at the contents of column V10, we see that the table contains
# named vectors of n > 1.
tail(hawaii_links_df$V10)
#> [[1]]
#> Hillary Clinton Tim Kaine
#> "/wiki/Hillary_Clinton" "/wiki/Tim_Kaine"
#>
#> [[2]]
#> Hillary Clinton Tim Kaine
#> "/wiki/Hillary_Clinton" "/wiki/Tim_Kaine"
#>
#> [[3]]
#> Joe Biden Kamala Harris
#> "/wiki/Joe_Biden" "/wiki/Kamala_Harris"
#>
#> [[4]]
#> Joe Biden Kamala Harris
#> "/wiki/Joe_Biden" "/wiki/Kamala_Harris"
#>
#> [[5]]
#> Presidential Elections
#> "/wiki/United_States_presidential_election"
#>
#> [[6]]
#> Electoral College votes
#> "/wiki/United_States_presidential_election"
The function row_to_names
from the janitor library allows the user to
turn single rows into column names. However, many tables contain header
and sub-headers, which means row_to_names
often falls short.
gttoolkit’s rows_to_headers
helps by allowing the user to aggregate
multiple rows into one vector of headers:
hawaii_text_df %>%
rows_to_names(row_nums = 1:3) %>%
tail()
#> Year<<=>>Year<<=>>Year Executive offices<<=>>Kingdom<<=>>Monarch
#> 244 2018\n David Ige (D)\n
#> 245 2019\n David Ige (D)\n
#> 246 2020\n David Ige (D)\n
#> 247 2021\n David Ige (D)\n
#> 248 Year Governor
#> 249 Year Executive offices
#> Executive offices<<=>>Kingdom<<=>>Prime Minister
#> 244 Doug Chin (D)[31]
#> 245 Josh Green (D)\n
#> 246 Josh Green (D)\n
#> 247 Josh Green (D)\n
#> 248 Lieutenant Governor
#> 249 Executive offices
#> State Legislature<<=>>Kingdom<<=>>House of Nobles
#> 244 25D\n
#> 245 24D, 1R\n
#> 246 24D, 1R\n
#> 247 24D, 1R\n
#> 248 State Senate
#> 249 State Legislature
#> State Legislature<<=>>Kingdom<<=>>House of Rep.
#> 244 46D, 5R\n
#> 245 46D, 5R\n
#> 246 46D, 5R\n
#> 247 47D, 4R\n
#> 248 State House
#> 249 State Legislature
#> State Department<<=>>Kingdom<<=>>Agent\n
#> 244 Mazie Hirono (D)\n
#> 245 Mazie Hirono (D)\n
#> 246 Mazie Hirono (D)\n
#> 247 Mazie Hirono (D)\n
#> 248 U.S. Senator (Class I)
#> 249 U.S. Congress
#> State Department<<=>>Kingdom<<=>>Agent\n
#> 244 \n
#> 245 \n
#> 246 \n
#> 247 \n
#> 248 U.S. Senator (Class III)
#> 249 U.S. Congress
#> State Department<<=>>Kingdom<<=>>Agent\n
#> 244 Colleen Hanabusa (D)\n
#> 245 Ed Case (D)\n
#> 246 Ed Case (D)\n
#> 247 Ed Case (D)\n
#> 248 U.S. House District 1
#> 249 U.S. Congress
#> State Department<<=>>Kingdom<<=>>Agent\n
#> 244 Tulsi Gabbard (D)\n
#> 245 Tulsi Gabbard (D)\n
#> 246 Tulsi Gabbard (D)\n
#> 247 Kai Kahele (D)\n
#> 248 U.S. House District 2
#> 249 U.S. Congress
#> Electoral College votes\n<<=>>Royal Elections\n<<=>>Royal Elections\n
#> 244 Hillary Clinton and Tim Kaine (D) N\n
#> 245 Hillary Clinton and Tim Kaine (D) N\n
#> 246 Joe Biden and Kamala Harris (D) Y\n
#> 247 Joe Biden and Kamala Harris (D) Y\n
#> 248 Presidential Elections\n
#> 249 Electoral College votes\n
The usefulness of rows_to_names
becomes apparent when the HTML table
in question has multiple headers and sub-headers scattered across the
entire length of the table. For example, the table of Political party
strength in
Hawaii
has numerous divisions in a single table. To make sense of the data we
should split the table, then recombine it once we’ve reorganized the
parts.
But first, the very first row should be part of the headers of every
component part. A simple call to rows_to_names
with its defaults will
take care of this, although we should take care to make sure the names
are unique so that the data frame plays nicely with dplyr functions:
library(purrr)
hawaii_text_df <-
hawaii_text_df %>%
rows_to_names() %>%
set_names(make.unique(names(.), "...")) # Make names unique
hawaii_text_df %>%
head()
#> Year Executive offices Executive offices...1 State Legislature
#> 2 Year Kingdom Kingdom Kingdom
#> 3 Year Monarch Prime Minister House of Nobles
#> 4 1810\n Kamehameha (NP)\n no such office\n no such offices\n
#> 5 1811\n Kamehameha (NP)\n no such office\n no such offices\n
#> 6 1812\n Kamehameha (NP)\n no such office\n no such offices\n
#> 7 1813\n Kamehameha (NP)\n no such office\n no such offices\n
#> State Legislature...1 State Department State Department...1
#> 2 Kingdom Kingdom Kingdom
#> 3 House of Rep. Agent\n Agent\n
#> 4 no such offices\n no such office\n no such office\n
#> 5 no such offices\n no such office\n no such office\n
#> 6 no such offices\n no such office\n no such office\n
#> 7 no such offices\n no such office\n no such office\n
#> State Department...2 State Department...3 Electoral College votes\n
#> 2 Kingdom Kingdom Royal Elections\n
#> 3 Agent\n Agent\n Royal Elections\n
#> 4 no such office\n no such office\n \n
#> 5 no such office\n no such office\n \n
#> 6 no such office\n no such office\n \n
#> 7 no such office\n no such office\n \n
We can use parse_table
again to extract HTML element names, which can
tell us where the headers are so we can split the data frame
appropriately:
name_props <- as.matrix(parse_table(wiki_table, html_name)) %>%
{rowMeans(. == "th", na.rm = TRUE)}
Using name_props
, we can create groups based on the location of the
first sub-header of each sub-header group. This is necessary because the
sub-headers have sub-headers. If we don’t find the first sub-header we
will end up with empty data frames:
hawaii_text_df_list <- hawaii_text_df %>%
mutate(th = {name_props[-1] > .9}, # [-1] because the first row is now the names.
group = map2_lgl(th,
lag(th),
# The first header of each group is the "th" element
# without a "th" before it.
~ ifelse(isTRUE(.x) && !isTRUE(.y), T, F)
) %>%
cumsum()
) %>%
group_split(group)
We can now iterate over our list of data frames and complete our headers by merging the remaining “th” rows with our current headers. The last element of the list is empty because of headers at the bottom of the original table, so we can drop it before iterating over the rest. While we’re at it, we can also rename the column containing years and convert its values to integers:
hawaii_text_df_list <-
hawaii_text_df_list[-length(hawaii_text_df_list)] %>%
map(~ rows_to_names(., row_nums = which(.$th), merge_current = TRUE) %>%
rename_with(~ ifelse(grepl("Year", ., T), "year", .)) %>%
mutate(year = as.integer(year)) %>%
select(-starts_with(c("th", "group"))) # Remove our "th" data to avoid confusion.
)
# Let's take a peek one table in the list.
head(hawaii_text_df_list[[1]])
#> # A tibble: 6 x 10
#> year `Executive offic… `Executive office… `State Legislatu… `State Legislatu…
#> <int> <chr> <chr> <chr> <chr>
#> 1 1810 "Kamehameha (NP)… "no such office\n" "no such offices… "no such offices…
#> 2 1811 "Kamehameha (NP)… "no such office\n" "no such offices… "no such offices…
#> 3 1812 "Kamehameha (NP)… "no such office\n" "no such offices… "no such offices…
#> 4 1813 "Kamehameha (NP)… "no such office\n" "no such offices… "no such offices…
#> 5 1814 "Kamehameha (NP)… "no such office\n" "no such offices… "no such offices…
#> 6 1815 "Kamehameha (NP)… "no such office\n" "no such offices… "no such offices…
#> # … with 5 more variables: State Department<<=>>Kingdom<<=>>Agent <chr>,
#> # State Department...1<<=>>Kingdom<<=>>Agent <chr>,
#> # State Department...2<<=>>Kingdom<<=>>Agent <chr>,
#> # State Department...3<<=>>Kingdom<<=>>Agent <chr>, Electoral College votes
#> # <<=>>Royal Elections <<=>>Royal Elections <chr>
Now that we’ve aggregated our headers, it’s fairly trivial to pivot to a “long” data frame and separate out the data found in the headers into variables:
library(tidyr)
hawaii_df_long <-
map_dfr(
hawaii_text_df_list,
pivot_longer,
cols = -year,
names_to = "branch_form_office",
values_to = "person_party"
) %>%
separate(branch_form_office, c("branch", "form", "office"), sep = "<<=>>") %>%
mutate(branch = sub("\\.{3}\\d{1,}", "", branch)) # Remove "unique" suffixes.
tail(hawaii_df_long)
#> # A tibble: 6 x 5
#> year branch form office person_party
#> <int> <chr> <chr> <chr> <chr>
#> 1 2021 "State Legislatur… "State\… "State House" "47D, 4R\n"
#> 2 2021 "State Department" "State\… "U.S. Senator (C… "Mazie Hirono (D)\n"
#> 3 2021 "State Department" "State\… "U.S. Senator (C… " \n"
#> 4 2021 "State Department" "State\… "U.S. House Dist… "Ed Case (D)\n"
#> 5 2021 "State Department" "State\… "U.S. House Dist… "Kai Kahele (D)\n"
#> 6 2021 "Electoral Colleg… "State\… "Presidential El… "Joe Biden and Kamala Har…
gttoolkit is freely available to anyone who might find it useful. That said, the package is a catch-all for functions I’ve developed to surmount hurdles in my own workflow, so don’t expect too much support. I also can’t guarantee that it will work on every system, though I do try to keep things as system-agnostic as possible. Contributions are appreciated if you think you can improve something.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.