pretty_kable <- function(dt) { 
  knitr::kable(dt, format = 'markdown')
}

Introduction

calc() is the analytic workhorse of rads. It provides a standardized method for obtaining most of what we usually want to calculate: means, medians, counts, confidence intervals, standard errors, relative standard errors (RSE), numerators, denominators, the number missing, and the proportion missing. calc() can be used with record data (e.g., vital statistics, census, enrollment numbers, etc.) as well as survey data (e.g., BRFSS, ACS PUMS, etc.).

calc() provides a unified interface built on top of common R packages, including data.table, which can make it faster than implementing custom functions with base R or dplyr in some cases. This makes it a convenient and efficient option, allowing APDE staff to use consistent syntax across different data sources. While all calc() functionality could be achieved with other packages directly, the primary benefit is standardization and ease of use—even if some specialized packages might occasionally offer more efficient implementations for specific analyses.

This vignette will provide some examples to introduce the calc() function by walking through basic analyses with vital statistics (birth data) and survey data (ACS PUMS). To get the most out of this vignette, we recommend that you type each and every bit of code into R. Doing so will almost definitely help you learn the syntax much faster than just reading the vignette or copying and pasting the code.

calc() arguments

Arguments are the values that we send to a function when it is called. The standard arguments for calc() are:

1) ph.data \<- the name of the data.table/data.frame or survey object that you want to analyze.

2) what \<- a character vector of the variable(s) for which you want to calculate the metrics. E.g., what = c("uninsured")

3) where \<- think of this as a SQL WHERE clause, i.e., a filtering or subsetting of data. E.g., ages %in% c(0:17) & gender == "Female"

4) by \<- a character vector of the variables that you want to computer the what by, i.e., the cross-tab variable(s). E.g., by = c("gender") would stratify results by gender

5) metrics \<- a character vector of the metrics that you want returned. E.g., metrics = c("mean", "rse"). You can see a complete list of available metrics by typing metrics() and get detailed descriptions of what each metric means by typing ?metrics().

6) per \<- an integer, which is the denominator when rate is selected as a metric. Metrics will be multiplied by this value. E.g., per = 1000. NOTE this is just a scalar. At present this does not calculate a true rate (i.e., the relevant population denominator is not used).

7) win \<- an integer, which is the number of consecutive units of time (e.g., years, months, etc.) over which the metrics will be calculated, i.e., the 'window' for a rolling average, sum, etc. E.g. win = 5 will perform calculations over every 5 time unit window.

8) time_var \<- a character, which is the name of the time variable in the dataset. Used in combination with the win argument to generate time windowed calculations.

9) fancy_time \<- a logical (i.e., T or F) flag. If TRUE, a record of all the years going into the data is returned. If FALSE, just a simple range (where certain years within the range might not be represented in your data).

10) proportion \<- a logical (i.e., T or F) flag determining whether the metrics should be calculated as a proportion. Currently only relevant for survey data. The default is FALSE.

11) ci \<- a numeric value between 0 & 1 for the confidence level returned in the estimates. E.g., 0.95 == 95% confidence interval

12) verbose \<- a logical used to toggle on/off printed warnings

There is no need to specify all of the arguments listed above. As you can see in the following example, the calc function simply needs a specified dataset (i.e., ph.data) and at least one what variable to return an intelligible result.

library(rads)
library(data.table)
data(mtcars)
calc(ph.data = mtcars, what = c("mpg"))[]
pretty_kable(calc(ph.data = mtcars, what = c("mpg"))[])

Note: The use of [] after calc() is used to print the output to the console. Typically, you would not print the results but would save them as an object. E.g., my.est <- calc().

Example vital statistics analyses

First get the birth data (cf. get_data vignette)

birth <- get_data_birth(cols = c("chi_year", "chi_sex", "chi_race_eth8", 
                                 "preterm", "birth_weight_grams", "mother_birthplace_state"), 
                        year = c(2013:2019), 
                        kingco = T)

Mean (proportion) for a binary over multiple years

calc(ph.data = birth, 
     what = c("preterm"), 
     metrics = c("mean", "rse", "numerator", "denominator"), 
     time_var = "chi_year")[]
pretty_kable(calc(ph.data = birth, 
     what = c("preterm"), 
     metrics = c("mean", "rse", "numerator", "denominator"), 
     time_var = "chi_year")[])

Mean (proportion) for a binary over multiple years -- where newborn is male

calc(ph.data = birth, 
     what = c("preterm"), 
     where = chi_sex == "Male",
     metrics = c("mean", "rse", "numerator", "denominator"), 
     time_var = "chi_year")[]
pretty_kable(calc(ph.data = birth, 
     what = c("preterm"), 
     where = chi_sex == "Male",
     metrics = c("mean", "rse", "numerator", "denominator"), 
     time_var = "chi_year")[])

Mean (proportion) for a binary over multiple years -- where newborn is male born to a Hispanic mother

calc(ph.data = birth, 
     what = c("preterm"), 
     where = chi_sex == "Male" & chi_race_eth8 == "Hispanic",
     metrics = c("mean", "rse", "numerator", "denominator"), 
     time_var = "chi_year")[]
pretty_kable(calc(ph.data = birth, 
     what = c("preterm"), 
     where = chi_sex == "Male" & chi_race_eth8 == "Hispanic",
     metrics = c("mean", "rse", "numerator", "denominator"), 
     time_var = "chi_year")[])

Mean (proportion) for a binary over individual years

birth[, cy := chi_year]
calc(ph.data = birth, 
     what = c("preterm"), 
     metrics = c("mean", "rse", "numerator", "denominator"), 
     time_var = "chi_year", 
     by = "cy")[]
pretty_kable(calc(ph.data = birth, 
     what = c("preterm"), 
     metrics = c("mean", "rse", "numerator", "denominator"), 
     time_var = "chi_year", 
     by = "cy")[])

Mean (proportion) for a binary over windowed years

calc(ph.data = birth, 
     what = c("preterm"), 
     metrics = c("mean", "rse", "numerator", "denominator"), 
     time_var = "chi_year", 
     win = 3)[]
pretty_kable(calc(ph.data = birth, 
     what = c("preterm"), 
     metrics = c("mean", "rse", "numerator", "denominator"), 
     time_var = "chi_year", 
     win = 3)[])

Mean for a continuous over windowed years

calc(ph.data = birth, 
     what = c("birth_weight_grams"), 
     metrics = c("mean", "rse"), 
     time_var = "chi_year", 
     win = 3)[]
pretty_kable(calc(ph.data = birth, 
     what = c("birth_weight_grams"), 
     metrics = c("mean", "rse"), 
     time_var = "chi_year", 
     win = 3)[])

Mean for a continuous in 2019, by gender and race/eth

calc(ph.data = birth, 
     what = c("birth_weight_grams"), 
     chi_year == 2019,
     metrics = c("mean", "rse"), 
     by = c("chi_race_eth8", "chi_sex"))[]
pretty_kable(calc(ph.data = birth, 
     what = c("birth_weight_grams"), 
     chi_year == 2019,
     metrics = c("mean", "rse"), 
     by = c("chi_race_eth8", "chi_sex"))[])

Proportion of 2017-2019 births among each race/eth group

calc(ph.data = birth, 
     what = c("chi_race_eth8"), 
     chi_year %in% 2017:2019,
     metrics = c("mean", "rse", "obs", "numerator", "denominator"))[]
pretty_kable(calc(ph.data = birth, 
     what = c("chi_race_eth8"), 
     chi_year %in% 2017:2019,
     metrics = c("mean", "rse", "obs", "numerator", "denominator"))[])

2017-2019 rate per 100k of births among each race/eth group

calc(ph.data = birth, 
     what = c("chi_race_eth8"), 
     chi_year %in% 2017:2019,
     metrics = c("obs", "numerator", "denominator", "rate"), 
     per = 100000)[]
pretty_kable(calc(ph.data = birth, 
     what = c("chi_race_eth8"), 
     chi_year %in% 2017:2019,
     metrics = c("obs", "numerator", "denominator", "rate"), 
     per = 100000)[])

Number and proportion of missing gender by year

calc(ph.data = birth, 
     what = c("chi_sex"), 
     chi_year %in% 2017:2019,
     metrics = c("obs", "missing", "missing.prop"), 
     by = "chi_year")[]
pretty_kable(calc(ph.data = birth, 
     what = c("chi_sex"), 
     chi_year %in% 2017:2019,
     metrics = c("obs", "missing", "missing.prop"), 
     by = "chi_year")[])

Example survey analyses

Survey data must be 'set' to use the proper survey design. However, survey data accessed through rads::get_data() functions have already been set as data.table/dtsurvey objects and are ready for analysis with calc().

pums <- get_data_pums(year = 2021, kingco = F, records = 'person')

Let's get on the same page re: the meaning of the columns from calc()

First, create a simple tabulation of the population of Seattle after sub-setting PUMS to King County.

test1 <- calc(ph.data = pums, 
             what = "nonseattle", 
             metrics = c('mean', 'numerator', 'denominator', 'obs', 'total'), 
             where = chi_geo_kc == "King County")
print(test1)
pretty_kable(test1)

To highlight the difference between denominator and obs, in this next example we introduce 100 missing values to our 'what' variable (nonseattle), thereby lowering the denominator by 100 but leaving the obs the same as above.

pums2 <- copy(pums)
pums2 <- pums2[nonseattle == 'Seattle', nonseattle := ifelse(rowid(nonseattle) <= 100, NA, nonseattle)]
test2 <- calc(ph.data = pums2, 
             what = "nonseattle", 
             metrics = c('mean', 'numerator', 'denominator', 'obs', 'total'), 
             where = chi_geo_kc == "King County")
print(test2)
pretty_kable(test2)

As expected, the obs value is the same as in table test1 above (r format(test2[1]$obs, big.mark = ',')), but the denominator value decreased by 100 to r format(test2[1]$denominator, big.mark = ',').

Mean (proportion) of those near poverty or disabled, by King County (vs. remainder of WA)

calc(ph.data = pums, 
     what = c("disability", "GEpov200"), 
     metrics = c("mean", "rse", "obs", "numerator", "denominator"), 
     proportion = T, 
     by = "chi_geo_kc")[]
pretty_kable( calc(ph.data = pums, 
     what = c("disability", "GEpov200"), 
     metrics = c("mean", "rse", "obs", "numerator", "denominator"), 
     proportion = T, 
     by = "chi_geo_kc")[])

Proportion in each CHNA age group, by disability status

calc(ph.data = pums, 
     what = c("age6"), 
     metrics = c("mean", "rse", "obs", "numerator", "denominator"), 
     proportion = F, 
     by = "disability")[]
pretty_kable(calc(ph.data = pums, 
     what = c("age6"), 
     metrics = c("mean", "rse", "obs", "numerator", "denominator"), 
     proportion = F, 
     by = "disability")[])

Mean & median age in King County, by disability status

calc(ph.data = pums, 
     what = c("agep"),
     chi_geo_kc == "King County",
     metrics = c("mean", "median", "rse", "obs", "numerator", "denominator"), 
     by = "disability")[]
calc(ph.data = pums, 
     what = c("agep"),
     chi_geo_kc == "King County",
     metrics = c("mean", "median", "rse", "obs", "numerator", "denominator"), 
     by = "disability")[]

Example analyses with non-standard data

In the examples above we used standard public health data (birth vital statistics and the Census Bureau's ACS survey). However, since calc() is a generalized function, you can use it with nearly any dataset, as long as it is a data.frame, a data.table, or a survey object. To demonstrate this, we will use calc() with synthetic data that we will generate below.

Create the dataset

library(data.table)
set.seed(98121) 

mydt <- data.table(
  school = as.factor(sample(c("Alpha", "Beta", "Gamma", "Delta"), 2000, replace = T)),
  grades = as.factor(sample(c("A", "B", "C", "D"), 2000, replace = T)), 
  year = sample(2016:2021, 2000, replace = T))

head(mydt)
pretty_kable(head(mydt))

We see that we created a dataset of 2000 rows with grades in four schools between with 2016 and 2021.

Calculate the proportion of A's and B's in the Alpha and Beta schools

grades.distribution <- calc(
  ph.data = mydt, 
  school %in% c("Alpha", "Beta"), 
  what = "grades", 
  by = "school", 
  time_var = "year", 
  metrics = c("numerator", "denominator", "mean"), proportion = F)

print(grades.distribution[level %in% c("A", "B")])
pretty_kable(grades.distribution[level %in% c("A", "B")])

These results show that the Alpha School has a higher proportion of A's (r round(grades.distribution[school=="Alpha" & level == "A"]$mean, 3) vs r round(grades.distribution[school=="Beta" & level == "A"]$mean, 3)) and a similar proportion of B's (r round(grades.distribution[school=="Alpha" & level == "B"]$mean, 3) vs r round(grades.distribution[school=="Beta" & level == "B"]$mean, 3)).

Wait! We forgot the survey weights!

A colleague just reminded you that you forgot to add the survey weights. Let's add weights and survey set the data.

# create weights
set.seed(98121)
mydt[, mywghts := sample(50:1300, 2000, replace = T)]

# survey set the data
# This uses the dtsurvey package
# similar logic applies for survey and srvyr package
mydt[, `_id` := NULL] # remove id to make things play nice
mysvy <-dtsurvey::dtsurvey(data.table(mydt), weight = 'mywghts')

Using the survey information, again calculate the proportion of A's and B's in the Alpha and Beta schools

grades.distribution2 <- calc(
  ph.data = mysvy, 
  school %in% c("Alpha", "Beta"), 
  what = "grades", 
  by = "school", 
  time_var = "year", 
  metrics = c("numerator", "denominator", "mean"), proportion = FALSE)

print(grades.distribution2[level %in% c("A", "B")])
pretty_kable(grades.distribution2[level %in% c("A", "B")])

You'll note that using the survey design caused small changes in the results. For example, the proportion of A's in the Alpha school changed from r round(grades.distribution[school=="Alpha" & level == "A"]$mean, 3) to r round(grades.distribution2[school=="Alpha" & level == "A"]$mean, 3).

Example analyses with resampled data/multiple imputation

There may be a few instances where a variable you want to compute some metrics using (multiple) imputed data. calc can work with these sorts of situations.

The following code creates a fake survey of 100 people and then creates 10 different iterations of the survey with differing values of v (to represent the imputed variable).

# Create some fake data
library('data.table')
base = data.table::data.table(id = 1:100, bin = sample(0:1, 100, T), psu = sample(1:3, 100, T), weight = runif(100, 0,2))
base = dtsurvey::dtsurvey(base, psu = 'psu', weight = 'weight')
midat = lapply(1:10, function(i){
  r = data.table::copy(base)[, v := sample(1:3, 100, T)]
})

To use the imputation version of calc, the ph.data argument must be an imputationList object. You can turn your list of plausible datasets into an imputationList via the mitools::imputationList function. You may need to install the mitools package for this to work.

midat = mitools::imputationList(midat)
class(midat)

Now that midat is the right type/class of object, you can pass it to the calc function as you would any other dataset. The slight exception is that the metrics argument must be explicitly specified. For fun, a version of the results from one imputation is also computed.

Results using MI combining methods

withmi = calc(ph.data =midat, what = 'bin', by = 'v', metrics = 'mean', proportion = T)
print(withmi)
pretty_kable(withmi[, .(v, variable, mean, mean_se, mean_lower, mean_upper)])

Results of one iteration

nomi = calc(ph.data = midat$imputations[[1]], what = 'bin', by = 'v')
print(nomi)
pretty_kable(nomi[, .(v, variable, mean, mean_se, mean_lower, mean_upper)])

Some additional notes:

  1. If there is no variation (or imputation) in the what or by variables, you should get the same result as doing just one iteration.

  2. The proportion argument is ignored when ph.data is an imputationList object. This is because the proportion argument governs what methods are used to create confidence intervals – and the special proportion methods are not currently implemented for imputed data estimates. As such, you may find confidence intervals that are outside the normal [0-1] constraints.

Knowing is half the battle ... but only half

You've been introduced to calc(), but you'll only become competent at using it by using it. Try it out with your favorite dataset. Play with it. Try to break it and, if you're successful, submit a GitHub issue. Enjoy!

-- r paste0('Updated ', format(Sys.time(), '%B %d, %Y'))



PHSKC-APDE/rads documentation built on April 14, 2025, 10:47 a.m.