library(rads)
The comparison of crude health indicator rates across populations is often confounded by different age structures. For example, in 2019 the mean age of Hispanic King County residents was ~12.5 years younger than that of white residents (27.5 vs 40). Given this age discrepancy, we should expect higher mortality rates among white residents. However, you might be thinking, “Is there a way to compare Hispanic and white mortality rates while fully accounting for their different age structures? I.e., is there a way to standardize or adjust for age?”
While you can use Poisson regression or other advanced statistical techniques to calculate age adjusted rates, for simplicity and broader comparison with the work of others, it is often beneficial to adjust your data using a standard reference population. This process is called age standardization. Age standardization answers the question, “What rates would I observe if my data had the same age structure as a different population?”
There are 35 reference populations baked into rads
(via it's dependency on rads.data
). Their names are visible when you type list_ref_pop()
. Here are the first five as an example:
list_ref_pop()[1:5]
The first one is the default reference population used by APDE and WA DOH. Taking a peek behind the curtain shows that each reference standard is a simple table of age ranges with their corresponding populations.
get_ref_pop("2000 U.S. Std Population (11 age groups)")[, 1:4]
The age_standardize()
function provides a simple way to use reference populations to calculate age standardized rates. This vignette will provide some examples of how to use age_standardize()
and how to understand its output. To get the most out of this vignette, we recommend that you type each and every bit of code into R. Doing so will almost definitely help you learn the syntax much faster than just reading the vignette or copying and pasting the code.
age_standardize()
argumentsArguments are the values that we send to a function when it is called. The standard arguments for age_standardize()
are:
1) ph.data
<- the name of the data.table/data.frame with the data that you want to age standardize. Note: ph.data must contain a numeric column named 'age'. The sole exception is if it already has a standard population merged onto the dataset, in which case it must contain a numeric column named 'stdpop'.
2) ref.popname
<- a character vector whose only valid options are "none" (used when ph.data has a 'stdpop' column) or one of the reference standarards contained in list_ref_pop()
3) collapse
<- a logical vector (T|F), determines whether or not to to collapse the 'age' column in ph.data to match those in ref.popname. If your data is already collapsed / aggregated into the same age bins as your reference population, and if that variable is named "agecat", then you can set collapse to F. The default is T
4) my.count
<- a character vector, identifies the column name for the aggregated count data in ph.data. The default is "count"
5) my.pop
<- a character vector, identifies the column name for the population number corresponding to the count. The default is "pop"
6) per
<- an integer, used as multiplier for all rates and CI, e.g., when per = 1000, the rates are per 1000 persons. The default is 100,000
7) conf.level
<- a numeric value between 0.00 & 0.99, designates the confidence level (i.e., 1 - alpha) to be used in the calculations. The default is 0.95
8) group_by
<- a character, specifies the variable(s) by which to stratify the rate results (if any). The default is NULL (i.e., no stratification)
Your dataset must have the following three columns ...
1) 'age' or 'agecat': 'age' in single years (if collapse = T) or 'agecat' with the same age bins as your selected reference population (if collapse = F).
2) a count for the event (e.g., disease) for which you want to find an age standardized rate.
3) the population corresponding to the age or agecat in your original data
Create synthetic line level data for a cohort 51 to 60 years of age with a binary indicator for disease status
library(data.table) set.seed(98121) temp1 <- data.table( age = rep(51:60, 100), disease = sample(0:1, 1000, replace = T), pop = rep(c(seq(1000, 910, -10)), 100)) temp1[]
Run age_standardize with APDE's default reference population
age_standardize(ph.data = temp1, ref.popname = "2000 U.S. Std Population (11 age groups)", collapse = T, my.count = "disease", my.pop = "pop", per = 1000, conf.level = 0.95)[]
Does anything seem fishy? Think for a moment ...
count is the number of events that occured (i.e., the numerator)
pop is the population corresponding to the count (i.e., the denominator)
crude.rate, crude.lci, & crude.uci are the crude (i.e., observed) rate and CI
adj.rate, adj.lci, & adj.uci are the age standardized rate and CI
Did you find the problem?
The problem is that the population is too big. Eyeballing the output when we created the dataset above, we see that the total population for those 51 to 60 would have to be less than 10,000 (10 age groups with max 1,000 population for each). In this case it is almost 1 million! In the description of the arguments above, it specifies that we need to use aggregated count data. Now you see why. Ignoring that detail caused us to inflate the population (and therefore deflate the rate and CI) 1000x.
Aggregate (collapse) the line level data
temp1 <- temp1[, .(disease = sum(disease)), by = c("age", "pop")] temp1[]
Run age_standardize again << this time with aggregated disease events
ex1.1 <- age_standardize(ph.data = temp1, ref.popname = "2000 U.S. Std Population (11 age groups)", collapse = T, my.count = "disease", my.pop = "pop", per = 1000, conf.level = 0.95) ex1.1[]
Now that count, pop, and rates seem reasonable, let's see what happens if we change the reference population. We'll arbitrarily set the references population to the 36th in the list provided by list_ref_pop()
(i.e., 'World (WHO 2000-2025) Std Million (single ages to 99)').
ex1.2 <- age_standardize(ph.data = temp1, ref.popname = list_ref_pop()[36], collapse = T, my.count = "disease", my.pop = "pop", per = 1000, conf.level = 0.95) ex1.2[]
As we'd expect, the crude rates are identical since the reference population is irrelevant for those calculations (r ex1.1[, .(crude.rate)]
vs r ex1.2[, .(crude.rate)]
). However, the age-standardized rates changed from r ex1.1[, .(adj.rate)]
to r ex1.2[, .(adj.rate)]
. Remember, if you want to compare your age-adjusted rates to those published by other health jurisdictions, it's important that you use the same reference population.
Let's create a new dataset with disease counts aggregated by age (46 to 64) and gender (F|M)
set.seed(98121) temp2 <- data.table( gender = c(rep("F", 20), rep("M", 20)), age = rep(46:65, 2), disease = c(sample(25:46, 20, replace = T), sample(25:35, 20, replace = T)), pop = c(sample(2500:3500, 20, replace = T), sample(2200:3300, 20, replace = T))) head(temp2)
Let's examine the overall rates
ex2.1 <- age_standardize(ph.data = temp2, collapse = T, my.count = "disease", my.pop = "pop", per = 1000, conf.level = 0.95) ex2.1[]
In this case, the crude rate and age standardized rate are the same. This doesn't mean there is a mistake. Notice that the confidence intervals differ -- as would be expected.
Also notice that we didn't include the ref.popname
argument. When it is not specified, age_standardize()
uses the default which is list_ref_pop()[1]
(i.e, r list_ref_pop()[1]
).
Now let's run the same analysis, but stratified by gender
ex2.2 <- age_standardize(ph.data = temp2, collapse = T, my.count = "disease", my.pop = "pop", per = 1000, conf.level = 0.95, group_by = "gender") ex2.2[]
Here we see that the crude and age standardized rates are higher among females when compared to males.
Create a reference population for the gendered dataset above
To keep things simple, we will create a reference population based on single ages rather than age bins. As specified in the arguments description above, we will name the standard population column 'stdpop'.
set.seed(98121) new.standard <- data.table( gender = c(rep("M", 20), rep("F", 20)), age = rep(46:65, 2), stdpop = c(sample(7800:16000, 20, replace = T), sample(10000:20000, 20, replace = T))) head(new.standard)
Merge the standard population onto the data
temp3 <- merge(temp2, new.standard, by = c("age", "gender"), all = T) head(temp3)
Calculate the rates when ref.pop = "none"
Note that I need to specify collapse = F
because the function expects data to be pre-aggregated for the custom reference population contained in stdpop
.
ex3.1 <- age_standardize(ph.data = temp3, ref.popname = "none", collapse = F, my.count = "disease", my.pop = "pop", per = 1000, conf.level = 0.95, group_by = "gender") ex3.1[]
collapse = F
In example #3 above, we specified collapse = F
because age_standardardize()
expects the data to be pre-aggregated when you provide a stdpop
column. The other time when you will want to set collapse = F
is if you have data that has already been collapsed down to the reference population's age bins along with the proper labels in a column called agecat
. This is uncommon. It isn't worth your time and energy to manually collapse the data -- so don't do it! This functionality is here just in case you receive data that has already been structured this way.
Let's recreate the dataset used in example 2 above
set.seed(98121) temp4 <- data.table( gender = c(rep("M", 20), rep("F", 20)), age = rep(46:65, 2), disease = c(sample(25:46, 20, replace = T), sample(25:35, 20, replace = T)), pop = c(sample(2500:3500, 20, replace = T), sample(2200:3300, 20, replace = T))) head(temp4)
Collapse the data down to the same age bins as those used in the APDE standard reference population
temp4[age %in% 45:49, agecat := "45-49 years"] temp4[age %in% 50:54, agecat := "50-54 years"] temp4[age %in% 55:59, agecat := "55-59 years"] temp4[age %in% 60:64, agecat := "60-64 years"] temp4[age %in% 65:69, agecat := "65-69 years"] temp4 <- temp4[, .(pop = sum(pop), disease = sum(disease)), by = c("agecat", "gender")] temp4[]
Now you are able to run age_standardize with collapse = F
ex4.1 <- age_standardize(ph.data = temp4, collapse = F, my.count = "disease", my.pop = "pop", per = 1000, conf.level = 0.95, group_by = "gender") ex4.1[]
Note the results in ex4.1 are exactly the same as those in ex2.2. This makes sense since age_standardize()
collapsed the data in a similar (but more efficient manner) when creating ex2.2.
You're staffing data requests today and receive the following message: "Hi! For a class, I'd like to get the 2019 teen (13 to 19) birth rate for King County and WA State as a whole. Since the age distribution in the county may differ from the rest of the state, it would be appreciated if you could provide crude and age-standardized rates. If you could standardize to "World (WHO 2000-2025) Std Million (single ages to 84)", that would be great! Thank you, S. Capstone"
You quickly pull up CHAT only to find that this specific indicator doesn't exist. You remember that CHI has an adolescent birth rate indicator, but are downcast when you find that it applies to those 15 to 17. In desperation, you resign yourself to using rads
.
Get birth counts for 13 to 19 year olds in 2019
kcbirth <- get_data_birth(cols = c("chi_age", "chi_year"), year = 2019, kingco = T) wabirth <- get_data_birth(cols = c("chi_age", "chi_year"), year = 2019, kingco = F) births <- rbind(kcbirth[, geo := "King County"], wabirth[, geo := "WA State"]) births <- births[chi_age %in% 13:19] # collapse / aggregate births <- births[, .(births = .N), by = c("chi_age", "geo")] setorder(births, geo, chi_age) setnames(births, "chi_age", "age") births[]
Get the female populations corresponding to the birth counts
kcpop <- get_population(kingco = T, years = 2019, ages = 13:19, genders = "Female", group_by = "ages", geo_vintage = 2020, census_vintage = 2020) kcpop <- kcpop[, .(age, geo = geo_id, pop)] wapop <- get_population(kingco = F, years = 2019, ages = 13:19, genders = "Female", group_by = "ages", geo_type = "zip", geo_vintage = 2020, census_vintage = 2020) wapop <- wapop[, .(pop = sum(pop), geo = "WA State"), by = "age"] pop <- rbind(kcpop, wapop) pop[]
Merge population onto to birth counts
temp5 <- merge(births, pop, by = c("age", "geo"), all = T) temp5[]
Run age_standardize()
ex5.1 <- age_standardize(ph.data = temp5, ref.popname = "World (WHO 2000-2025) Std Million (single ages to 84)", collapse = T, my.count = "births", my.pop = "pop", per = 1000, conf.level = 0.95, group_by = "geo") ex5.1[]
According to your analysis, the crude King County teen birth rate is approximately half (r ex5.1[geo=="King County"]$crude.rate
/ r ex5.1[geo=="WA State"]$crude.rate
= r round2(ex5.1[geo=="King County"]$crude.rate/ex5.1[geo=="WA State"]$crude.rate, 3)
) of the WA State teen birth rate. This relationship remains largely unchanged after age-standardization (r ex5.1[geo=="King County"]$adj.rate
/ r ex5.1[geo=="WA State"]$adj.rate
= r round2(ex5.1[geo=="King County"]$adj.rate/ex5.1[geo=="WA State"]$adj.rate, 3)
).
If you've walked through this vignette and more or less understood what's going on, you're in good shape! If you're still confused, please walk through it again and then reach out if you still have questions. Good luck!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.