README.md
In gershomtripp/gttoolkit: Gershom Tripp's Toolkit

gttoolkit

gttoolkit is a library where I collect custom functions for personal workflows. As such, it will continue to evolve as I encounter problems that other libraries cannot solve.

Because gttoolkit is an eternal work-in-progress there is no released version. Use the developer version instead:

devtools::install_github("gershomtripp/gttoolkit")

Here’s a summary of the current tools:

parse_table: Flexibly extract any kind of data from HTML tables. This includes not only text (rvest::html_table already does this), but also HTML element attributes (href, style, etc.), element names (th, td, div), raw HTML, etc.
rows_to_names: Flexibly combine any number of rows as column names. This function is similar to janitor::row_to_names, except that it allows the user to extract and concatenate headers and sub-headers, while also allowing some preprocessing of header rows via name_fn. This last feature is particularly important when using parse_table, which returns a data frame of list columns.

An issue with rvest::html_table is that it doesn’t offer any straightforward method to extract anything other than text from HTML tables, despite much useful information being in the HTML element attributes.

parse_table solves this issue by allowing the user to pass in a function to target specific data. By default parse_data just extracts text:

library(gttoolkit)
library(rvest)
library(dplyr)
#> 
#> Attache Paket: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

url <- "https://en.wikipedia.org/wiki/Political_party_strength_in_Hawaii"

wiki_table <- read_html(url) %>%
  html_element(".wikitable")

hawaii_text_df <- parse_table(wiki_table) %>%
  mutate(across(.fns = unlist))

hawaii_text_df %>%
  tail()
#>         V1                V2                  V3                V4
#> 244 2018\n   David Ige (D)\n   Doug Chin (D)[31]             25D\n
#> 245 2019\n   David Ige (D)\n    Josh Green (D)\n         24D, 1R\n
#> 246 2020\n   David Ige (D)\n    Josh Green (D)\n         24D, 1R\n
#> 247 2021\n   David Ige (D)\n    Josh Green (D)\n         24D, 1R\n
#> 248   Year          Governor Lieutenant Governor      State Senate
#> 249   Year Executive offices   Executive offices State Legislature
#>                    V5                     V6                       V7
#> 244         46D, 5R\n     Mazie Hirono (D)\n                       \n
#> 245         46D, 5R\n     Mazie Hirono (D)\n                       \n
#> 246         46D, 5R\n     Mazie Hirono (D)\n                       \n
#> 247         47D, 4R\n     Mazie Hirono (D)\n                       \n
#> 248       State House U.S. Senator (Class I) U.S. Senator (Class III)
#> 249 State Legislature          U.S. Congress            U.S. Congress
#>                         V8                    V9
#> 244 Colleen Hanabusa (D)\n   Tulsi Gabbard (D)\n
#> 245          Ed Case (D)\n   Tulsi Gabbard (D)\n
#> 246          Ed Case (D)\n   Tulsi Gabbard (D)\n
#> 247          Ed Case (D)\n      Kai Kahele (D)\n
#> 248  U.S. House District 1 U.S. House District 2
#> 249          U.S. Congress         U.S. Congress
#>                                       V10
#> 244 Hillary Clinton and Tim Kaine (D) N\n
#> 245 Hillary Clinton and Tim Kaine (D) N\n
#> 246   Joe Biden and Kamala Harris (D) Y\n
#> 247   Joe Biden and Kamala Harris (D) Y\n
#> 248              Presidential Elections\n
#> 249             Electoral College votes\n

For performance reasons parse_table doesn’t attempt to remove white space, which involves the use of regular expressions. Most data needs further cleaning anyway, even after the white space has been removed, so it makes sense to chunk these tasks for a step farther down the pipeline.

It’s also worth noting that parse_table always takes an object of class “xml_nodeset” as input and returns a data frame of list columns, which is essentially a list of lists, as output. Beyond being faster, this also allows the user the benefit of being able to extract anything that a list can contain, including vectors of n > 1, named vectors, lists, data frames, etc.

hawaii_links_df <- 
    parse_table(
        wiki_table,
        cell_fn = function(x) {
            html_elements(x, "a") %>%
                html_attr("href") %>%
                `names<-`(html_text2(html_elements(x, "a")))
        }
    )

# Taking a look at the contents of column V10, we see that the table contains
# named vectors of n > 1.
tail(hawaii_links_df$V10)
#> [[1]]
#>         Hillary Clinton               Tim Kaine 
#> "/wiki/Hillary_Clinton"       "/wiki/Tim_Kaine" 
#> 
#> [[2]]
#>         Hillary Clinton               Tim Kaine 
#> "/wiki/Hillary_Clinton"       "/wiki/Tim_Kaine" 
#> 
#> [[3]]
#>             Joe Biden         Kamala Harris 
#>     "/wiki/Joe_Biden" "/wiki/Kamala_Harris" 
#> 
#> [[4]]
#>             Joe Biden         Kamala Harris 
#>     "/wiki/Joe_Biden" "/wiki/Kamala_Harris" 
#> 
#> [[5]]
#>                      Presidential Elections 
#> "/wiki/United_States_presidential_election" 
#> 
#> [[6]]
#>                     Electoral College votes 
#> "/wiki/United_States_presidential_election"

The function row_to_names from the janitor library allows the user to turn single rows into column names. However, many tables contain header and sub-headers, which means row_to_names often falls short. gttoolkit’s rows_to_headers helps by allowing the user to aggregate multiple rows into one vector of headers:

hawaii_text_df %>%
  rows_to_names(row_nums = 1:3) %>% 
  tail()
#>     Year<<=>>Year<<=>>Year Executive offices<<=>>Kingdom<<=>>Monarch
#> 244                 2018\n                           David Ige (D)\n
#> 245                 2019\n                           David Ige (D)\n
#> 246                 2020\n                           David Ige (D)\n
#> 247                 2021\n                           David Ige (D)\n
#> 248                   Year                                  Governor
#> 249                   Year                         Executive offices
#>     Executive offices<<=>>Kingdom<<=>>Prime Minister
#> 244                                Doug Chin (D)[31]
#> 245                                 Josh Green (D)\n
#> 246                                 Josh Green (D)\n
#> 247                                 Josh Green (D)\n
#> 248                              Lieutenant Governor
#> 249                                Executive offices
#>     State Legislature<<=>>Kingdom<<=>>House of Nobles
#> 244                                             25D\n
#> 245                                         24D, 1R\n
#> 246                                         24D, 1R\n
#> 247                                         24D, 1R\n
#> 248                                      State Senate
#> 249                                 State Legislature
#>     State Legislature<<=>>Kingdom<<=>>House of Rep.
#> 244                                       46D, 5R\n
#> 245                                       46D, 5R\n
#> 246                                       46D, 5R\n
#> 247                                       47D, 4R\n
#> 248                                     State House
#> 249                               State Legislature
#>     State Department<<=>>Kingdom<<=>>Agent\n
#> 244                       Mazie Hirono (D)\n
#> 245                       Mazie Hirono (D)\n
#> 246                       Mazie Hirono (D)\n
#> 247                       Mazie Hirono (D)\n
#> 248                   U.S. Senator (Class I)
#> 249                            U.S. Congress
#>     State Department<<=>>Kingdom<<=>>Agent\n
#> 244                                       \n
#> 245                                       \n
#> 246                                       \n
#> 247                                       \n
#> 248                 U.S. Senator (Class III)
#> 249                            U.S. Congress
#>     State Department<<=>>Kingdom<<=>>Agent\n
#> 244                   Colleen Hanabusa (D)\n
#> 245                            Ed Case (D)\n
#> 246                            Ed Case (D)\n
#> 247                            Ed Case (D)\n
#> 248                    U.S. House District 1
#> 249                            U.S. Congress
#>     State Department<<=>>Kingdom<<=>>Agent\n
#> 244                      Tulsi Gabbard (D)\n
#> 245                      Tulsi Gabbard (D)\n
#> 246                      Tulsi Gabbard (D)\n
#> 247                         Kai Kahele (D)\n
#> 248                    U.S. House District 2
#> 249                            U.S. Congress
#>     Electoral College votes\n<<=>>Royal Elections\n<<=>>Royal Elections\n
#> 244                                 Hillary Clinton and Tim Kaine (D) N\n
#> 245                                 Hillary Clinton and Tim Kaine (D) N\n
#> 246                                   Joe Biden and Kamala Harris (D) Y\n
#> 247                                   Joe Biden and Kamala Harris (D) Y\n
#> 248                                              Presidential Elections\n
#> 249                                             Electoral College votes\n

The usefulness of rows_to_names becomes apparent when the HTML table in question has multiple headers and sub-headers scattered across the entire length of the table. For example, the table of Political party strength in Hawaii has numerous divisions in a single table. To make sense of the data we should split the table, then recombine it once we’ve reorganized the parts.

But first, the very first row should be part of the headers of every component part. A simple call to rows_to_names with its defaults will take care of this, although we should take care to make sure the names are unique so that the data frame plays nicely with dplyr functions:

library(purrr)

hawaii_text_df <- 
  hawaii_text_df %>%
  rows_to_names() %>%
  set_names(make.unique(names(.), "...")) # Make names unique

hawaii_text_df %>% 
  head()
#>     Year Executive offices Executive offices...1 State Legislature
#> 2   Year           Kingdom               Kingdom           Kingdom
#> 3   Year           Monarch        Prime Minister   House of Nobles
#> 4 1810\n Kamehameha (NP)\n      no such office\n no such offices\n
#> 5 1811\n Kamehameha (NP)\n      no such office\n no such offices\n
#> 6 1812\n Kamehameha (NP)\n      no such office\n no such offices\n
#> 7 1813\n Kamehameha (NP)\n      no such office\n no such offices\n
#>   State Legislature...1 State Department State Department...1
#> 2               Kingdom          Kingdom              Kingdom
#> 3         House of Rep.          Agent\n              Agent\n
#> 4     no such offices\n no such office\n     no such office\n
#> 5     no such offices\n no such office\n     no such office\n
#> 6     no such offices\n no such office\n     no such office\n
#> 7     no such offices\n no such office\n     no such office\n
#>   State Department...2 State Department...3 Electoral College votes\n
#> 2              Kingdom              Kingdom         Royal Elections\n
#> 3              Agent\n              Agent\n         Royal Elections\n
#> 4     no such office\n     no such office\n                        \n
#> 5     no such office\n     no such office\n                        \n
#> 6     no such office\n     no such office\n                        \n
#> 7     no such office\n     no such office\n                        \n

We can use parse_table again to extract HTML element names, which can tell us where the headers are so we can split the data frame appropriately:

name_props <- as.matrix(parse_table(wiki_table, html_name)) %>%
  {rowMeans(. == "th", na.rm = TRUE)}

Using name_props, we can create groups based on the location of the first sub-header of each sub-header group. This is necessary because the sub-headers have sub-headers. If we don’t find the first sub-header we will end up with empty data frames:

hawaii_text_df_list <- hawaii_text_df %>%
  mutate(th = {name_props[-1] > .9}, # [-1] because the first row is now the names.
         group = map2_lgl(th,
                          lag(th),
                          # The first header of each group is the "th" element 
                          # without a "th" before it.
                          ~ ifelse(isTRUE(.x) && !isTRUE(.y), T, F)
                          ) %>% 
           cumsum()
         ) %>%
  group_split(group)

We can now iterate over our list of data frames and complete our headers by merging the remaining “th” rows with our current headers. The last element of the list is empty because of headers at the bottom of the original table, so we can drop it before iterating over the rest. While we’re at it, we can also rename the column containing years and convert its values to integers:

hawaii_text_df_list <- 
  hawaii_text_df_list[-length(hawaii_text_df_list)] %>%
  map(~ rows_to_names(., row_nums = which(.$th), merge_current = TRUE) %>% 
        rename_with(~ ifelse(grepl("Year", ., T), "year", .)) %>% 
        mutate(year = as.integer(year)) %>% 
        select(-starts_with(c("th", "group"))) # Remove our "th" data to avoid confusion.
      )

# Let's take a peek one table in the list.
head(hawaii_text_df_list[[1]])
#> # A tibble: 6 x 10
#>    year `Executive offic… `Executive office… `State Legislatu… `State Legislatu…
#>   <int> <chr>             <chr>              <chr>             <chr>            
#> 1  1810 "Kamehameha (NP)… "no such office\n" "no such offices… "no such offices…
#> 2  1811 "Kamehameha (NP)… "no such office\n" "no such offices… "no such offices…
#> 3  1812 "Kamehameha (NP)… "no such office\n" "no such offices… "no such offices…
#> 4  1813 "Kamehameha (NP)… "no such office\n" "no such offices… "no such offices…
#> 5  1814 "Kamehameha (NP)… "no such office\n" "no such offices… "no such offices…
#> 6  1815 "Kamehameha (NP)… "no such office\n" "no such offices… "no such offices…
#> # … with 5 more variables: State Department<<=>>Kingdom<<=>>Agent  <chr>,
#> #   State Department...1<<=>>Kingdom<<=>>Agent  <chr>,
#> #   State Department...2<<=>>Kingdom<<=>>Agent  <chr>,
#> #   State Department...3<<=>>Kingdom<<=>>Agent  <chr>, Electoral College votes
#> #   <<=>>Royal Elections <<=>>Royal Elections  <chr>

Now that we’ve aggregated our headers, it’s fairly trivial to pivot to a “long” data frame and separate out the data found in the headers into variables:

library(tidyr)

hawaii_df_long <- 
map_dfr(
  hawaii_text_df_list,
  pivot_longer,
  cols = -year,
  names_to = "branch_form_office",
  values_to = "person_party"
) %>% 
  separate(branch_form_office, c("branch", "form", "office"), sep = "<<=>>") %>% 
  mutate(branch = sub("\\.{3}\\d{1,}", "", branch)) # Remove "unique" suffixes.

tail(hawaii_df_long)
#> # A tibble: 6 x 5
#>    year branch             form     office            person_party              
#>   <int> <chr>              <chr>    <chr>             <chr>                     
#> 1  2021 "State Legislatur… "State\… "State House"     "47D, 4R\n"               
#> 2  2021 "State Department" "State\… "U.S. Senator (C… "Mazie Hirono (D)\n"      
#> 3  2021 "State Department" "State\… "U.S. Senator (C… " \n"                     
#> 4  2021 "State Department" "State\… "U.S. House Dist… "Ed Case (D)\n"           
#> 5  2021 "State Department" "State\… "U.S. House Dist… "Kai Kahele (D)\n"        
#> 6  2021 "Electoral Colleg… "State\… "Presidential El… "Joe Biden and Kamala Har…

gttoolkit is freely available to anyone who might find it useful. That said, the package is a catch-all for functions I’ve developed to surmount hurdles in my own workflow, so don’t expect too much support. I also can’t guarantee that it will work on every system, though I do try to keep things as system-agnostic as possible. Contributions are appreciated if you think you can improve something.

gershomtripp/gttoolkit documentation built on Dec. 20, 2021, 10:41 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com