README.md
In ianmoran11/locatr: Easier Tidying of Data from Spreadsheets

The locatr package makes tidying data from spreadsheets easier. It helps identify and classify table cells, and then visually inspect them. .

The locatr package is not available on CRAN. It can be installed from github with the following script:

# install.packages("devtools")
devtools::install_github("ianmoran11/locatr")

The locate functions work much like unpivotr::behead. The key difference is that, rather than progressively removing headers, locate functions annotate the tidyxl data frame with .direction, .header_group and .value columns, leaving reshaping to a final function call.

Minimal example : `locate`

Here's a minimal example involving a table with two row headers and two column headers.

The first step is to locate the data cells with the locate_data function. Calling locate_data and providing an expression that filters for data cells sends these cells to an attribute named data_cells.

locatr_example("worked-examples.xlsx") %>% 
  xlsx_cells_fmt(sheets = "pivot-example") %>%
  locate_data(data_type == "numeric") %>% 
  attr("data_cells")
#> # A tibble: 16 x 24
#>    .value .direction .header_label address   row   col data_type character
#>    <chr>  <chr>      <chr>         <chr>   <int> <int> <chr>     <chr>    
#>  1 1      <NA>       <NA>          D4          4     4 numeric   <NA>     
#>  2 2      <NA>       <NA>          E4          4     5 numeric   <NA>     
#>  3 3      <NA>       <NA>          F4          4     6 numeric   <NA>     
#>  4 0      <NA>       <NA>          G4          4     7 numeric   <NA>     
#>  5 3      <NA>       <NA>          D5          5     4 numeric   <NA>     
#>  6 4      <NA>       <NA>          E5          5     5 numeric   <NA>     
#>  7 5      <NA>       <NA>          F5          5     6 numeric   <NA>     
#>  8 1      <NA>       <NA>          G5          5     7 numeric   <NA>     
#>  9 5      <NA>       <NA>          D6          6     4 numeric   <NA>     
#> 10 6      <NA>       <NA>          E6          6     5 numeric   <NA>     
#> 11 9      <NA>       <NA>          F6          6     6 numeric   <NA>     
#> 12 2      <NA>       <NA>          G6          6     7 numeric   <NA>     
#> 13 7      <NA>       <NA>          D7          7     4 numeric   <NA>     
#> 14 8      <NA>       <NA>          E7          7     5 numeric   <NA>     
#> 15 12     <NA>       <NA>          F7          7     6 numeric   <NA>     
#> 16 3      <NA>       <NA>          G7          7     7 numeric   <NA>     
#> # … with 16 more variables: numeric <dbl>, date <dttm>, logical <lgl>,
#> #   error <chr>, is_blank <lgl>, local_format_id <int>, sheet <chr>,
#> #   character_formatted <list>, formula <chr>, is_array <lgl>,
#> #   formula_ref <chr>, formula_group <int>, comment <chr>, height <dbl>,
#> #   width <dbl>, style_format <chr>

plot_cells produces a plot that indicates which cells are now labelled as data.

locatr_example("worked-examples.xlsx") %>% 
  xlsx_cells_fmt(sheets = "pivot-example") %>%
  locate_data(data_type == "numeric") %>% 
  plot_cells()

Once the data cells are identified, we can add header information to the tidyxl data frame (including .direction, .header_group and .value columns) using the locate function. This function requires direction and variable names. Again, plot_cells can be called to check that data cells have been identified correctly.

Once all header have directions and names, migrate reshapes the tidyxl data frame into a tidy structure.

The gif below illustrate how direction informations is progressively added to the data frame.

And below is the code used in the gif.

locatr::locatr_example("worked-examples.xlsx") %>%
  xlsx_cells_fmt(sheets = "pivot-example") %>%
  locate_data(data_type == "numeric") %>% 
  locate(direction = "WNW", name = subject_type) %>% 
  locate(direction = "W", name = subject) %>% 
  locate(direction = "NNW", name = gender) %>% 
  locate(direction = "N", name = name) %>% 
  migrate()
#> # A tibble: 16 x 7
#>      row   col .value gender name    subject_type subject 
#>    <int> <int> <chr>  <chr>  <chr>   <chr>        <chr>   
#>  1     4     4 1      Year 1 Matilda Humanities   Classics
#>  2     4     5 2      Year 1 Paul    Humanities   Classics
#>  3     5     4 3      Year 1 Matilda Humanities   History 
#>  4     5     5 4      Year 1 Paul    Humanities   History 
#>  5     6     4 5      Year 1 Matilda Performance  Music   
#>  6     6     5 6      Year 1 Paul    Performance  Music   
#>  7     7     4 7      Year 1 Matilda Performance  Drama   
#>  8     7     5 8      Year 1 Paul    Performance  Drama   
#>  9     4     6 3      Year 2 Matilda Humanities   Classics
#> 10     4     7 0      Year 2 Paul    Humanities   Classics
#> 11     5     6 5      Year 2 Matilda Humanities   History 
#> 12     5     7 1      Year 2 Paul    Humanities   History 
#> 13     6     6 9      Year 2 Matilda Performance  Music   
#> 14     6     7 2      Year 2 Paul    Performance  Music   
#> 15     7     6 12     Year 2 Matilda Performance  Drama   
#> 16     7     7 3      Year 2 Paul    Performance  Drama

Conditional headers : `locate_if`

Sometimes not all headers in the same column or row belong to the same group. For example, in the table below, the row headers in column B represent a mix of subject type and subject name.

To deal with this we create a variable that represents the indenting of cells, and then use locate_if to selectively associate cells with directions and header groups.

locatr_example("worked-examples.xlsx") %>% 
  xlsx_cells_fmt(sheets = "pivot-indent") %>%
  append_fmt(fmt_alignment_indent) %>%
  locate_data(data_type == "numeric") %>%
  locate_if(fmt_alignment_indent == 0, direction = "WNW", name = subject_type) %>% 
  locate_if(fmt_alignment_indent == 1, direction = "W", name = subject) %>% 
  locate(direction = "N", name = student) %>% 
  migrate()
#> # A tibble: 8 x 6
#>     row   col .value student subject_type subject 
#>   <int> <int> <chr>  <chr>   <chr>        <chr>   
#> 1     4     3 1      Matilda Humanities   Classics
#> 2     4     4 2      Paul    Humanities   Classics
#> 3     5     3 3      Matilda Humanities   History 
#> 4     5     4 4      Paul    Humanities   History 
#> 5     7     3 5      Matilda Performance  Music   
#> 6     7     4 6      Paul    Performance  Music   
#> 7     8     3 7      Matilda Performance  Drama   
#> 8     8     4 8      Paul    Performance  Drama

A more concise syntax : `locate_groups`

We can deal with multiple headers differentiated by formatting more concisely using locate_groups. The .grouping argument allows us to indicate which formats differentiate headers. In this case, hierarchy is indicated by indenting, which can be accessed with the fmt_alignment_indent function. The .hook_if argument receives an expression with hook that indicates which header_groups are "WNW" rather than "N". The .hook_if_rev argument will switch directions from "N" to "WSW". Importantly the hook expression is passed into summarise so it needs to reduce columns to a single boolean value. This is the reason for using any in the example below.

locatr_example("worked-examples.xlsx") %>%
  xlsx_cells_fmt(sheets =  "pivot-indent") %>%
  append_fmt(fmt_alignment_indent) %>% 
  locate_data(data_type == "numeric") %>%
  locate_groups(direction = "W",
                .groupings = groupings(fmt_alignment_indent),
                .hook_if =     hook_if(any(fmt_alignment_indent == 0))) %>%
  locate(direction = "N", name = student) %>% 
  plot_cells()

To browse different aspects of formatting on which to separate headers, type fmt_ and tab

Here's a more complicate table.

We can tidy this table by:

filtering to include only the upper table (filtering out any cells below the first containing "RSE")
locating the data, preventing the inclusion of the cells containing 2018 and 2019 in column 1
differentiating row groups based on whether they are numeric cells
identifying column headers, using the header_fill argument to deal with merged cells.

annotated_df <- 
  locatr_example("newresconst.xlsx") %>% 
  xlsx_cells_fmt(sheets = "Table 1 - Permits") %>%
  append_fmt(fmt_font_bold) %>% 
  filter(row < min(row[str_detect(character,"RSE")],na.rm = TRUE)) %>% 
  locate_data(data_type == "numeric" & col > 1) %>%
  locate_groups(direction = "W", 
                .groupings = groupings(is.na(numeric)), 
                .hook_if = hook_if(any(data_type == "numeric"))) %>% 
  locate_groups(direction = "N", header_fill = "style")  

annotated_df %>% plot_cells()


annotated_df %>% migrate()
#> # A tibble: 156 x 7
#>      row   col .value N_header_label_… N_header_label_… W_header_label_…
#>    <int> <int> <chr>  <chr>            <chr>            <chr>           
#>  1     9     2 1377   United States    "Total"          2018            
#>  2     9     3 851    United States    "1 unit"         2018            
#>  3     9     4 40     United States    "2 to 4 units"   2018            
#>  4     9     5 486    United States    "5 units\r\n or… 2018            
#>  5     9     6 135    Northeast        "Total"          2018            
#>  6     9     7 51     Northeast        "1 unit"         2018            
#>  7     9     8 203    Midwest          "Total"          2018            
#>  8     9     9 119    Midwest          "1 unit"         2018            
#>  9     9    10 652    South            "Total"          2018            
#> 10     9    11 456    South            "1 unit"         2018            
#> # … with 146 more rows, and 1 more variable: W_header_label_02 <chr>

Note that older versions of dplyr require substituting filter for filter_fmt.

ianmoran11/locatr documentation built on June 14, 2020, 3:32 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com