In katiecoburn/generalizeR: What the Package Does (One Line, Title Case)

# Note:
#  changing back to this (below)
#   rmarkdown::html_vignette
# would probably make the vignette smaller.

knitr::opts_chunk$set(echo = TRUE)
library(crayon)

Introduction

Note that it is possible to use the package at either end of the sampling process; that is, you can start at the beginning and use the package to obtain a stratified sample, or you can start at the end with your previously-obtained sample data and assess its generalizability.

Setup

# install.packages('devtools')
library(devtools)
library(tidyverse)
# install_github('katiecoburn/generalizeRdata')
# install_github('katiecoburn/generalizeR')
library(generalizeRdata)
library(generalizeR)
library(ggthemes)
library(gridExtra)
# install.packages("usmap")
library(usmap)

Soon the generalizeR package will be on CRAN and can be installed with 'install.packages'. The supplemental package, generalizeRdata, however, will always be on GitHub since the data files it contains are quite large and incapable of meeting CRAN's size constraints.

Step 1: Stratify

The stratify() function defaults to 'guided' mode, meaning that it prints text throughout and prompts the user to enter information with a question-and-answer format. This is intentional, in part to mimic the nature of http://thegeneralizer.org/ and in part because we intend for this R package to be as accessible as possible for people with limited R experience.

Users who are running stratify() multiple times, who are familiar with R, or who simply dislike the guided feature can turn it off by using the argument 'guided = FALSE'. If they do so, however, they must be sure to specify values for all the other function arguments, as those would otherwise be obtained from user input later.

This tutorial will follow a hypothetical example.

Example

The first is of an education researcher who wants to test the effect of an intervention on SAT scores. The researcher has somewhat limited resources, so they plan on a sample size of 40 schools. They want to estimate the average effect of their intervention in Texas charter high schools. Thus, their inference population consists of all Texas charter high schools. Previous literature suggests that gender, minority status, and social class might affect gaps in achievement, so the researcher selects a few stratifying variables -- percentage female, percentage black, and percentage free and reduced lunch. The researcher also thinks school size might result in treatment differences, so they include the total school size as well.

Since this researcher is working with high schools, they can use the Common Core database we provide in 'generalizeRdata'. For information on each of the columns and the source of the data, run '?cc'.

cc

By looking at the 'st' column, we can already see a problem here. The data frame contains schools from all states in the US, but the researcher has a much narrower inference population. Therefore, there is some filtering they must do first.

To identify Texas schools, the researcher can select those rows with 'st' equal to 'TX':

inference_pop <- cc %>% 
  filter(st == "TX")

There is a variable in the data frame called 'charter'; '?cc' will show that it takes on the value of 0 for non-charter schools and 1 for charter schools. Finally, to select high schools (which almost always include grades 9 to 12), the researcher can use one of the grade indicator variables, say 'g_10_offered'. After combining all of these filters, we get:

inference_pop <- cc %>% 
  filter(st == "TX") %>% 
  filter(charter == 1) %>% 
  filter(g_10_offered == "Yes")

inference_pop

The size of our inference population has dropped from over 98,000 schools to about 350 schools. It now only includes Texas charter high schools (or schools that offer 10th grade).

The researcher can then run:

output <- stratify(data = inference_pop)

They are greeted with the following:

cat(bold("If you want to store your results, make sure you assign \nthis function to an object.\n\n"))

cat("Your chosen inference population is the '", deparse(substitute(inference_pop)), "' dataset.", sep = "")
cat("\n")
cat("\n")

idnum <- readline(prompt = "Enter the name of the ID Variable in your dataset: ")

It's important to note that the user can hit the 'Escape' key here to stop the function if they have forgotten to store their results -- that is, to use the assignment operator '<-' to set their results to an object (here, called 'output').

At the prompt, they enter the column name that contains the NCES school IDs in the CCD database -- "ncessch":

cat("If you want to adjust or restrict your inference population \n(e.g., if you are interested in only one location, etc.), \nmake sure that you have altered the data frame appropriately. \nIf you need to alter your data frame, you can exit this \nfunction, use dplyr::filter(), and \nreturn.\n")

The researcher has already done this, so they select 'yes'.

idnum <- "ncessch"
data <- inference_pop
id <- data %>% select(all_of(idnum))
data <- data %>% select(-all_of(idnum))

cat("\nYou're now ready to select your stratification variables. \nThe following are the variables available in your dataset.")

The researcher now selects the variables that represent their stratifying variables of interest -- 'pct_female', 'pct_black_or_african_american', 'pct_free_and_reduced_lunch', and 'total'. They enter: '29 32 38 42'.

The function prints out a list of the variables chosen with names highlighted and a table of variables and their types. This is so the researcher can look them over, confirm what they selected, and make sure the variables are the expected types. Since these results seem reasonable, the researcher chooses 'Yes'.

Next, the function prints out the descriptive statistics of the stratification variables chosen, categorical and continuous. In this case, all the variables are continuous. Plots of the variables -- histograms or bar charts, as appropriate -- are generated and displayed one at a time.

par(mar = c(4, 4, .1, .1))
plot1 <- inference_pop %>% 
  ggplot(aes(x = total)) + geom_histogram(bins = 30) +
  theme_base() + 
  xlab("total") +
  labs(title = "Histogram of total")

plot2 <- inference_pop %>% 
  ggplot(aes(x = pct_black_or_african_american)) + geom_histogram(bins = 30) +
  theme_base() + 
  xlab("pct_black") +
  labs(title = "Histogram of pct_black")

grid.arrange(plot1, plot2, ncol = 2)

par(mar = c(4, 4, .1, .1))
plot1 <- inference_pop %>% 
  ggplot(aes(x = pct_female)) + geom_histogram(bins = 30) +
  theme_base() + 
  xlab("pct_female") +
  labs(title = "Histogram of pct_female")

plot2 <- inference_pop %>% 
  ggplot(aes(x = pct_free_and_reduced_lunch)) + geom_histogram(bins = 30) +
  theme_base() + 
  xlab("pct_frlunch") +
  labs(title = "Histogram of pct_frlunch")

grid.arrange(plot1, plot2, ncol = 2)

Now the researcher is ready to choose a number of strata. The function prints some information, similar to that presented by the Generalizer Web application, explaining what the choice represents and giving users some guidance. While choosing more strata is generally better, the practical demand placed on the sampling process increases with the number of strata, because units must be sampled from each stratum.

Since the researcher in our example plans on a sample size of 40, they try a smaller number -- 4 strata:

The process of stratifying can take some time. If the function were to run silently, users might fear that R had frozen and quit. As a sort of progress bar, we have turned on "verbose" mode for the clustering process, which prints out the within-cluster sum of squares at each iteration until the stratifying process converges:

The results begin by telling the user the percentage of population variation explained by the strata -- in this case, about 66%. Increasing the number of strata would increase the percentage explained and result in more homogeneous strata, but would require more resources.

The function also provides a table of the within-cluster means and standard deviations for each of the stratifying variables, and a count of the total number of units in each stratum.

test <- (stratify(inference_pop, guided = FALSE, n_strata = 4, variables = c("total", "pct_black_or_african_american", "pct_female", "pct_free_and_reduced_lunch"), idnum = "ncessch"))

It prints a heat map:

test$heat_plot_final

Each column of the heat map corresponds to a stratum. Users can read the map by going down the columns, assessing each stratum in relation to the population. Since the goal is for the strata to be homogeneous, to sample across them and gain a sample representative of the population, each stratum will differ from the others. We'll "read" a stratum here as an example:

The first stratum contains 161 Texas charter high schools which are close to the population mean in number of students (about 430), above the population mean in the percentage of students on free or reduced lunch (80%), close to the population mean in percentage of female students (50%), and below the population mean in percentage of black students (10%).

(Users may wonder why the shade of red for the bottom left and bottom right cells differs when both seem to have a mean of 10%. This is because of the scale of the variable, the size of the population mean, and rounding.)

Finally, users are given the option to change the number of strata. The researcher says 'no'; they are satisfied with four.

They have completed the stratifying process! For their study, the next step is figuring out which units from these strata to recruit. They can proceed to the second function, 'recruit().'

Step 2: Recruit

In each of these examples, our researchers have successfully stratified their inference populations into 4 and 8 clusters, respectively. Their goal is now to sample units from each cluster in such a way that their overall sample will be representative of their entire inference population. The 'recruit()' function is their next step.

Like the 'stratify()' function, 'recruit()' is guided by default; users can simply set 'guided = FALSE' to turn this option off if they choose, but (again) they must specify values for the other function arguments instead.

Since the results of 'stratify()' were saved to an object, that object can be read into 'recruit()', which automatically learns information about the stratification process.

Example

recruit(output)

Step 3: Assess

Example

The researcher in this example has taken their four recruitment lists and successfully recruited the desired number of units from each stratum: 20, 4, 6, and 11 units from strata 1, 2, 3, and 4, respectively. They have completed their study and are interested in whether their results are generalizable to other states in the US.

Recall that their inference population consisted of Texas charter high schools. Therefore, they most likely want to know whether their sample can generalize to other charter high schools in the US.

First the researcher specifies the variables that they hope are generalizable:

selection_vars <- c("pct_female", "pct_black_or_african_american", "pct_free_and_reduced_lunch", "total")

Then they read in their sample. The way they do this may differ -- for example, they might have their four separate recruitment lists, or one large file, or their original stratify() object. The important thing is to create a data frame with a column that consists of the sample IDs.

output <- test

sample <- tibble(ncessch = c(output$recruitment_lists[[1]]$ncessch[1:20],
                output$recruitment_lists[[2]]$ncessch[1:4],
                output$recruitment_lists[[3]]$ncessch[1:6],
                output$recruitment_lists[[4]]$ncessch[1:11]))

They specify their inference population again -- this time a data frame including all states with and without charter high schools. The data frame should include unit IDs, variables of interest, and (if applicable) a grouping variable; since the researcher wants to generalize to US states, the grouping variable is 'st' (state).

inference_pop <- cc %>% 
  filter(charter == 1) %>%
  filter(g_10_offered == "Yes") %>%
  select(ncessch, all_of(selection_vars), st)

Finally, they feed their sample and populatiuon data frames, the name of the ID variable, and the name of the grouping variable to 'assess_wrap()' -- a wrapper of the included 'assess()' function for ease of use.

gen_results <- assess_wrap(sample = sample, population = inference_pop,
            join_var = "ncessch", grouping_var = "st")

They can view a table of their results across states. This code extracts the 'st' variable as a factor and retains unique values, one row per state:

state <- inference_pop %>% 
  na.omit() %>% 
  select(st) %>% 
  transmute(state = factor(st)) %>% 
  unique()
g_indexes <- gen_results[[2]]

Then that information is bound together with the generalizability indexes:

g_overview <- tibble(state, g_index = g_indexes)
g_overview

However, the information is easier to assess in the form of a map. There are many mapping tools in R. First, we use the tidyverse to turn the generalizability index values into a categorical variable representing the three levels of generalizability.

g_overview <- g_overview %>% mutate(g_index = case_when(
  g_index >= 0.90 ~ "best",
  (g_index < 0.90 & g_index >= 0.50) ~ "okay",
  g_index < 0.50 ~ "worst",
  is.na(g_index) ~ "worst"
))

A G-index above 0.90 means that the sample is equivalent to a randomized trial. A G-index between 0.50 and 0.90 means that the sample is not quite a miniature of the population, but statistical adjustments to reweight the sample may make generalization possible. Finally, a G-index below 0.50 means that generalization (based upon the selected covariates) is completely unwarranted, and statistical adjustment cannot help.

plot_usmap(data = g_overview, values = "g_index", labels = TRUE) +
   labs(title = "US States",
       subtitle = "Generalizability Index") +
  theme(panel.background = element_rect(color = "black"), 
        legend.position = "right") + 
  scale_fill_grey()

This map tells the researcher a few things.

Their results are equivalent to a randomized controlled trial for generalizations to charter high schools in California and Texas. Texas is to be expected here, since their entire sample came from Texas. California most likely has a charter high school population that is very similar to Texas in terms of the stratifying variables (pct_female, pct_black_or_african_american, pct_free_and_reduced_lunch, and total number of students).

With statistical adjustments, they could make generalizations to the populations of charter high schools in: Washington, Nevada, Utah, Arizona, Colorado, Kansas, Oklahoma, Minnesota, Indiana, Wisconsin, Michigan, Ohio, Pennsylvania, New York, Connecticut, North and South Carolina, Georgia, Florida, and Arkansas.

Generalizations to the remaining states would be completely unwarranted; in some cases this is because there are no charter high schools in the state (labeled 'NA').

This researcher had a sample size of only 40, but based on careful, stratified sampling (and with some adjustment), they should be able to generalize their intervention results to the populations of 22 US states.