In svenhalvorson/SvenR: Tools for exploration

#
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  #out.width = "100%",
  fig.pos = 'c'
  #fig.align = "center"
)

library('lubridate')
library('tidyverse')
library('kableExtra')
library('SvenR')

UPDATE 2023-06-29: I haven't seriously used this package in a few years and my skills have improved dramatically since. There's some good stuff in here but it probably needs to be overhauled if it's to be of use to anyone else. I'm not sure whether I'll refactor/trim this down or just leave it as a relic of the past.

I like to pretend I'm a software developer so I created this little package. It's not a super fancy or cohesive group of functions but I do use these (mostly interactively) in the course of my work. In any case creating a library is fun so maybe you will enjoy it too. I'll show some examples of what the code can do and what the ideas behind it are.

Installation

Get it from github:

devtools::install_github('svenhalvorson/svenr')
library('SvenR')

Some of the functions I've tested quite a bit and others less so. If you find any horrendous errors, please let me know at svenpubmail@gmail.com

Formatting numbers

If I ever get the infamous interview question 'what is your biggest weakness?' I have an answer prepped: low tolerance for tedium. The idea of manually punching in lots of numbers to report is not only unappealing, it opens the door to unreproducible research. As a result, I created a function to round confidence intervals and one to make nice IQRs. I also have one for rounding cohort tables (produced by the tableone package) that I might clean up and post later.

The first function, format_ci tries to make your confidence intervals cleaner and easier to read. It takes in three vectors (point estimate, upper/lower bounds) and starts and iterative rounding process. Here we'll show an example with the nycflights dataset and broom

# Make a binary variable of if the plane left on time:
flights2 = nycflights13::flights %>% 
  mutate(
    on_time = as.numeric(dep_delay <= 0)
  )

# now make a logistic model of this with a few factors
on_time_model = glm(
  on_time ~ dep_time + origin + distance,
  family = binomial, 
  data = flights2
)

# And here's the table of unformatted confidence intervals:
on_time_est = on_time_model %>% 
  broom::tidy(
    conf.int = TRUE,
    exponentiate = TRUE
  ) %>% 
  select(
    term,
    estimate,
    conf.low,
    conf.high
  )

on_time_est %>%
  kable(
    align = 'c',
    format = 'markdown'
  )

There are a number of problems you can come across when trying to round the confidence intervals. One of which is that you are likely to use a null value, such as 1 in the case of an odds ratio, but you don't want any of these numbers to be rounded to that. Another concern is that you would like to have three distinct numbers displayed so that we can properly see the width of the interval and whether it contains the null value. This function can be set with a null value and other unacceptable values which will not be allowed as endpoints of the CI:

on_time_est_fmt = with(
  on_time_est,
  format_ci(
    point = estimate,
    lower = conf.low,
    upper = conf.high,
    null_value = 1
  )
)

on_time_est_fmt %>% 
  kable(
    align = 'c',
    format = 'markdown'
  )

You can tell it how many digits to round to but it will guess if not supplied. It has a maximum number of iterations (which can be specified) to try before giving up so keep that in mind.

I added the function format_p to deal with p-values in the same spirit:

p_vals = c(0.05033, 0.7894, 0.999964, 0.00007)

format_p(
  p = p_vals,
  min_digits = 3,
  max_digits = 4,
  level = 0.05
)

Like format_ci, I force the rounding not to round to potentially ambiguous values. The min_digits I asked for was three but the first p-value goes to four digits to be clear about which side of the significance level it's on. The argument max_digits controls how small a p-value can be before it is listed as 'P < 0.00....1'.

Another formatting function I wrote is pretty_iqr which makes a nice version of a median [Q1, Q3] for a vector:

pretty_iqr(
  x = flights2$dep_delay,
  digits = 1
)

Time weighted averages

Time weighted averages are a way of summarizing a numerical variable over many time points. It's often useful when the measurements occur at irregular intervals. Basically we're multiplying the values by how long they occur for and then dividing by the total time. It's very similar to taking a Riemann sum.

Here's some example data:

start_date = ymd_hms('2019-01-01 00:00:00')
twa_ex = tibble(
  id = c(1, 1, 1, 1, 2, 2),
  val = c(4, 6, 8, 6, 1, NA),
  t = minutes(c(0, 10, 15, 45, 0, 10)) + start_date
)
kable(
  twa_ex, 
  align = 'c', 
  format = 'markdown'
)

The idea here is that have an id variable, a value variable, and a time variable. We want to summarize the value over time. There are three methods of counting the points that are supported: trapezoids and left/right endpoints.

Visually, id #1's values look like this:

id1 = twa_ex %>% 
  filter(id == 1)
id1 = id1 %>% 
  bind_rows(id1) %>% 
  bind_rows(id1) %>% 
  mutate(
    group = case_when(
      (row_number()-1) %/% 4 == 0 ~ 'Trapezoid',
      (row_number()-1) %/% 4 == 1 ~ 'Left',
      (row_number()-1) %/% 4 == 2 ~ 'Right')
  )

sticks = tribble(~t, ~val, ~group,
                 id1$t[1], 0, 'Trapezoid',
                 id1$t[1], 4, 'Trapezoid',
                 id1$t[2], 6, 'Trapezoid',
                 id1$t[3], 8, 'Trapezoid',
                 id1$t[4], 6, 'Trapezoid',
                 id1$t[4], 0, 'Trapezoid',
                 id1$t[1], 0, 'Left',
                 id1$t[1], 4, 'Left',
                 id1$t[2], 4, 'Left',
                 id1$t[2], 6, 'Left',
                 id1$t[3], 6, 'Left',
                 id1$t[3], 8, 'Left',
                 id1$t[4], 8, 'Left',
                 id1$t[4], 0, 'Left',
                 id1$t[1], 0, 'Right',
                 id1$t[1], 6, 'Right',
                 id1$t[2], 6, 'Right',
                 id1$t[2], 8, 'Right',
                 id1$t[3], 8, 'Right',
                 id1$t[3], 6, 'Right',
                 id1$t[4], 6, 'Right',
                 id1$t[4], 0, 'Right')

ggplot() +
  geom_polygon(data = sticks, mapping = aes(x = t, y = val), fill = 'goldenrod1') +
  geom_point(data = id1, mapping = aes(x = t, y = val), size = 2) +
  #theme_minimal() +
  scale_y_continuous(limits = c(0, 10), breaks = 0:5*2) + 
  facet_grid(.~group) +
  theme(axis.text.x = element_text(angle = 90))

The time weighted average is the area in yellow divided by the total time (45 min). The methods will produce similar results if the number of data points is large but they can be different in a small data set like this.

The time weighted average using left endpoints is this:

[\frac{4\cdot10+6\cdot5+8\cdot30}{45}=6.89]

Using the function:

twa(
  df = twa_ex, 
  value_var = val, 
  time_var = t, 
  id, 
  method = 'left'
)

You must supply the data frame to use, identify the time and value variables, list any id variables, and the method. The function computes the time weighted average across each combination of the ids, it tells you the total time used, the largest/smallest intervals (gap), the number of measures received, the number utilized, and the number missing.

Some notes:

Records with missing values or times are removed
If multiple records occur at the same time, the median is used
If only one record is given for a particular combination of ids, it is returned
Nonstandard evaluation is used column names within the data frame
You can supply a numeric, non-POSIXct time vector

I also allowed for computing this summary statistic relative to a reference value. The four ref_dir modes are as follows:

raw: no alterations to the data
above x: The distance above x is counted instead of the raw values. Values below x are counted as zeroes.
below x: The converse of above.
about x: The absolute distance from x is used.

Here's an example of computing the time weighted average above 5:

twa(
  df = twa_ex, 
  value_var = val, 
  time_var = t, 
  id, 
  ref = 5, 
  ref_dir = 'above', 
  method = 'left'
)

sticks2 = tribble(~t, ~val,
                 id1$t[2], 5,
                 id1$t[2], 6,
                 id1$t[3], 6,
                 id1$t[3], 8,
                 id1$t[4], 8,
                 id1$t[4], 5)
ggplot() +
  geom_polygon(data = sticks2, mapping = aes(x = t, y = val), fill = 'goldenrod1') +
  geom_point(data = id1, mapping = aes(x = t, y = val), size = 2) +
  scale_y_continuous(limits = c(0, 10), breaks = 0:5*2) + 
  theme(axis.text.x = element_text(angle = 90)) +
  geom_hline(yintercept = 5, linetype = 'dashed')

This can be useful if you have a benchmark value you're trying to compare to. Note that it uses the entire 45 minutes as the denominator even though the first reading was set to zero because it is less than 5.

Merging time intervals

Another problem I have encountered more than a few times is when records of events are broken up across multiple observations that capture time intervals. An example of this is records of a patient's stay in the ICU. Every time they move rooms the attending caregiver may enter a new record of when the patient came in and out. Often this style of measurement is not recorded very accurately and we may have gaps between the in/out times, overlaps between records, and records that are purely within the timespan of another.

For many research questions, these changes don't matter. We just want to know when they first entered and finally left. To address this I wrote a function, merge_periods, to join time intervals together. Here is some data I simulated to demonstrate this:

data(periods_data)
periods_data

Here we can see that there is a 8 minute gap between the first and second observations. The third and fourth overlap quite a bit. Here is how we can call merge_periods to clean the data:

merged_periods = merge_periods(
  df = periods_data,
  start_time = ts_start,
  end_time = ts_end,
  id,
  tolerance = 3,
  units = 'minutes'
)

The required arguments supplied are the data frame (df) and the column names of the start and end time. After that any number of columns to group by can be supplied (in this case just id). Then there are optional named arguments such as tolerance which will allow gaps of that number of units to be merged together. With this call the first and second records will not be merged since they are 8 minutes apart but the second and third will. Here is some of the output:

merged_periods

In the revised data set, we have fewer rows as several of the records have been deleted or combined. Here the first row stays the same since it is more than three minutes from the next but rows 2, 3, and 4 have all been combined.

Another feature of this is that you can optionally return a nested data fram that contains the original and merged data along with optional plots:

merged_periods2 = merge_periods(
  df = periods_data,
  start_time = ts_start,
  end_time = ts_end,
  id,
  tolerance = 3,
  units = 'minutes',
  simplify = FALSE,
  plots = TRUE
)

When simplify = FALSE the result is a nested tibble that contains the original and merged data:

merged_periods2

If plots = TRUE then we have another column of ggplots showing how the intervals were combined:

merged_periods2$plot[[1]]

This can be used to check whether or not the function is working correctly, look for patterns in the periods across cases, and pick a suitable tolerance for merging.

Checking IDs

I often use this to make sure my merges are doing what I expect. The next function can either check if a combination of columns uniquely specify the observations or try and find such a combination. Do cyl and mpg uniquely specify the cars in mtcars?

check_id(mtcars, cyl, mpg)

You can also use it to try and search for a unique combination of variables by only supplying the data frame:

check_id(mtcars)

The function starts searching by single columns, then tries pairs of columns, up to the number of columns equal to the value supplied to max_depth before giving up. In this case, any of those 9 pairs of variables uniquely specify the observations. If no unique keys are found, the closest combination(s) are listed:

check_id(mtcars, max_depth = 1)

Lastly, here's a variation on duplicated called dupes that I find much more useful for investigating. It flags every observation with at least one other duplicate:

mtcars %>% 
  dplyr::mutate(drat_dupe = dupes(drat)) %>% 
  dplyr::arrange(drat) %>% 
  dplyr::select(drat, drat_dupe)

Most of the time when investigating observations with duplicated keys, I want to see the other values that are not duplicated to try and differentiate the observations. This was inspired by the STATA function 'duplicates tag' that makes it easier to look at observations with the same IDs.

Missing data

I have a couple of functions that I wrote to help identify missing data. First off, I just kept writing sum(is.na(x)) so here it is:

sum_na(x = c(NA, NA, 3, 4, NA, NA))

I also wrote a summary function, col_miss, for a data set. It computes the percent of observations that are missing for each column:

dat1 = tibble::tibble(
  x = c(NA, NA, 3, 4, NA, NA),
  y = c(NA, 'a', 'b', 'c', '', '')
)

col_miss(dat1)

You can tell it to consider empty strings as missing:

col_miss(dat1, empty_string = TRUE)

I'm not sure about you but at my old job I always received excel sheets with vertically merged cells. When you load these up, they have a bunch of blank entries that should be repititions. Here's a function that can deal with that:

NM = c(NA, 'Ruidoso', NA, '', NA, 'Corona', NA, 'Roswell')
fill_down(NM)
fill_down(NM, empty_string = TRUE)
fill_down(NM, reverse = TRUE)

I later found out that the function tidyr::fill does almost the same thing. fill_down does two things differently though:

It can treat blank strings as missing
It can operate on vectors outside of data frames

For these reasons I've kept it around but it's not necessary most of the time.

Creating crosswalks

Frequently I encounter text data that I want to bin. This often comes about when I have hand written records or data with very slight variations on a theme. Sometimes it's easy enough to do this with regular expressions but I often find these unreliable if the underlying data changes. Sometimes it's nicer to just make a crosswalk manually so I created a shiny gadget to do exactly this.

Here's an example data set of hospital visits to different departments:

med_table = tibble::tibble(
  dept = c(
    'pulmonary',
    'pulmonary',
    'heart',
    'cardio',
    'cardio',
    'pulm'),
  visit = c(
    'surgery',
    'surg',
    'visit',
    'surgery',
    'surgery',
    'office'
  )
)
med_table

To use this tool, call the crosswalk function on a data frame and supply the columns you wish to cross from:

crosswalk(med_table, dept, visit)

This will create a table in your Rstudio viewer:

The last column will always be whatever the first one you enter prefixed with 'new_'. The last column is editable:

Then when you hit accept, it will generate the code that creates your crosswalk:

med_table_cross = structure(list(dept = c("pulmonary", "pulmonary", "heart", "cardio", "pulm"), visit = c("surgery", "surg", "visit", "surgery", "office"), new_dept = c("pulm_surg", "pulm_surg", "heart_visit", "heart_surg", "pulm_visit")), row.names = c(NA, -5L), class = "data.frame")

med_table_cross

Then you'll usually join it onto the original:

med_table %>% 
  dplyr::left_join(med_table_cross)

I created this using the rstudioapi so it actually replaces the crosswalk command in the rstudio source editor with the result. The reason for this is that this way the data.frame can just be implanted into a script and other users will not need SvenR or the clicking and typing that created the crosswalk.

I did create some safegaurds to try and prevent people (mostly me) from deleting their own code so it will not execute if you have multiple lines highlighted when you run crosswalk or the text of that line does not contain crosswalk(.+).

A few more notes:

Notice that med_table had 6 observations but the cross has 5. This is because crosswalk only takes distinct combinations of the variables supplied.
If you want to get cominations of variables not present in the data set, you can set all_combos = TRUE to use expand instead of distinct

A lovely hotkey

This last piece might not seem so amazing but it's actually one of the most useful things I've done in this library. It's just a little add-in, called pipe_next, that takes whatever is left your cursor on that line and sets it equal to itself, then a pipe, and then puts the cursor to the next line indented. I set this to a hotkey and use it nonstop:

pipe lyfe