knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(tntpmetrics)
library(dplyr)
library(magrittr)

In most TNTP projects, we collect data on common metrics to understand the extent to which students, families, teachers, or others in an entire school, district, network, or state, have access to a valuable resource. It's often not practical to collect all the possible data -- we can't collect every assignment students in a school receive over an entire year, we can't observe every lesson, we can't survey every teacher -- so we gather data on the common metrics for a subset of classrooms, teachers, etc. From whom or from which classrooms data is collected is consequential; if we do not do so randomly, the common metric data might not be representative of the broader population on which we're hoping to make inferences.

Yet, if we choose participants randomly, we risk getting a random sample of data that does not contain enough variation to make equity or group comparisons. For example, if we randomly choose 5 schools from a district of 100, there is a chance we randomly choose primary schools only, or get five schools that look similarly demographically. It would be better if we could randomly pick schools while at the same time maximizing the chance the schools vary on the characteristics that matter the most to us and the outcome(s) we're measuring.

This is exactly what sampling with implicit stratification does, and it's why serious educational research organizations like NCES use it often. The tntpmetrics package contains tools for you to easily draw your own samples with implicit stratification. This will help ensure the common metric data you collect is representative of your target population and will allow you to compare results across equity groups.

Practice data: cms_data

To demonstrate how to apply the implicit stratification sampling functions in tntpmetrics, we will use school-level data on all public schools in the Charlotte-Mecklenberg School district in the 2018-2019 school year. This data, cms_data, has already been cleaned and processed as part of an old Academic Diagnostic contract. This cleaning includes filling in (or imputing) missing values for some newer schools with the district mean so that the data has no missingness.

Included in this data is a categorical variable indicating grade-levels served (Primary, Middle, High, or Other), the proportion of students receiving free or reduced-price lunch, the proportion of students of color, the state school accountability score (spg_score) and letter grade(spg_grade), and the number of enrolled students.

head(cms_data)

How does implicit stratification work? It's mostly just sorting data

To learn more about the specifics of implicit stratification, read the TNTP memo "placeholder_memo_title" (link to Adam's, potentially revised, memo). In short, we sort the by the variable(s) on which we want to ensure variation, pick a random starting spot on this sorted list, and then count every k rows to get all of our selected units. The number of steps k represents is based on how many units are sampled. For example, there are r tally(cms_data) schools in the CMS data set. To sample 10 schools, pick a random starting row between 1 and 17, and then count every 17th school until the end of the data. Each school picked along the way is part of the sample.

Importantly, because we first sorted the data on the characteristics we care about, we ensured that schools are the most spread out they could be on these variables -- i.e., the sort ensures that schools at the top of the data set look different on these variables than schools on the bottom. Thus, when we walk through the data, counting every 17th school, it's unlikely we get a sample of schools that all look the same on these characteristics. In fact, the only time this could happen is if schools in the district don't actually vary much on the characteristics we specify.

To be concrete, let's manually draw an implicitly stratified sample of 10 schools from the cms_data, where we want to ensure schools vary on the percent of students of color.

# Sort on soc_percent
sorted_data <- cms_data %>%
  arrange(soc_percent)

# Pick a random number between 1 and 17, and then count every 17 integers ten times
set.seed(1)
random_start <- sample(1:17, 1)
selected_rows <- seq(from = random_start, by = 17, length.out = 10)

# Keep the rows corresponding to the selected rows
implicit_sampled_data <- sorted_data %>%
  slice(selected_rows)

Let's see how much variability we have on soc_percent in our sampled data, compared to just randomly selecting 10 schools without implicit stratification.

# Randomly pick 10 schools without stratification
set.seed(1)
nonimplicit_sampled_data <- cms_data %>%
  sample_n(10)

# Use Standard Deviation to compare variability on percent_soc
sd(implicit_sampled_data$soc_percent)
sd(nonimplicit_sampled_data$soc_percent)

The standard deviation of soc_percent in the implicit sample is higher than in the simple random sample.

Sorting on more than one variable: Serpentine sorting

Drawing an implicit stratified sample is easy with just one characteristic of interest. We just did it manually above! It might seem just as easy with two characteristics of interest, as we could just sort on both variables. But look what happens when we sort the cms_data by grade-level and proportion of students receiving FRL.

cms_data %>%
  select(school_name, grade_level_cat, frl_percent, spg_score) %>%
  arrange(grade_level_cat, frl_percent) %>%
  slice(24:29)

The data is first sorted (alphabetically) on grade-level, and then within each grade-level the data is sorted in ascending order on the proportion of students receiving FRL. The code above shows where in the data it goes from high schools to middle schools. Notice how the proportion of FRL students makes a huge jump from Gariner High to Jay M Robinson Middle. Once the data gets to the next grade-level category, it must start over again from the lowest values of frl_percent. This leads to large differences in concurrent rows at these junctures. Though these schools already differ by grade-level, this disconnect means they're very different on other key factors, too. Not only do they differ on their FRL proportion, but because this variable is linked to other characteristics, we see that they're also very different on something like school achievement. These disconnected junctures in the sort are magnified if we sort by even more than two variables.

Instead, we want concurrent rows of the data to be as similar as possible on the characteristics of interest. This is what helps us obtain the most variability possible in our sample. To do this, we must use a sorting approach known as serpentine, where instead of always starting over with an ascending sort at each juncture, we alternate between ascending and descending sorts to ensure the juncture keeps the values as similar as possible. tntpmetrics has a built in function to perform serpentine sorts. Let's apply it to the CMS data and look at the same juncture as above.

cms_data %>%
  select(school_name, grade_level_cat, frl_percent, spg_score) %>%
  serpentine(grade_level_cat, frl_percent) %>%
  slice(24:29)

Now the first middle school in the list is the one with the highest proportion of students receiving FRL. The implicit stratification sample function in tntpmetrics uses this serpentine sort, so you do not need to use the function serpentine directly, though it's available to you as part of the package if you need it for other purposes, or you need to modify the default sampling approach. It works much like dplyr::arrange: simply list the variables you want to serpentine sort.

The order of sorting variables matters

Even with serpentine sorting, you will get a differently sorted list if you first sort on grade-level and then FRL percent compared to first sorting on FRL percent and then grade-level. The reasons are obvious, but it's important to point out. In general, it's best to first sort on the variables that are most important to you, as this will be the variable on which your data is most spread out. Then, choose variables in order from most to least important. If you sort on many variables, the last variables you include will have a smaller effect on the position, as much of the data order is already "locked in".

Making sorts categorical

What if we wanted to ensure variability on both percent FRL and SPG Score in a school? The obvious answer is to just include them both in a sort.

cms_data %>%
  select(school_name, grade_level_cat, frl_percent, spg_score) %>%
  serpentine(frl_percent, spg_score) %>%
  slice(1:10)

Notice how the SPG score barely seems sorted at all. That is because the data is first sorted on FRL percent, and because this variable is nearly unique to each school: once the data is sorted on it, there is very little to no room left to sort anything else.

When you want to include multiple non-categorical variables in your implicit stratification, you have two options:

  1. Don't do it; just pick one non categorical variable. This approach might be worthwhile if the two variables of interest are closely related. In these cases, adding the second variable doesn't add much to the stratification. But if the variables are not closely related and it's not not critically important to keep one of the variables as is, your best option is to
  2. Make the variables categorical. This is the more common approach as it allows you to keep both variables in the stratification process.

In the CMS example, let's turn percent FRL and percent SOC into categorical variables.

cms_data %<>%
  mutate(
    frl_cat = case_when(
      frl_percent < 0.25 ~ "< 25%",
      frl_percent < 0.50 ~ "25-50%",
      frl_percent < 0.75 ~ "50-75%",
      frl_percent <= 1.00 ~ "> 75%"
    ),
    frl_cat = ordered(frl_cat, levels = c("< 25%", "25-50%", "50-75%", "> 75%")),
    soc_cat = case_when(
      soc_percent < 0.25 ~ "< 25%",
      soc_percent <= 0.50 ~ "25-50%",
      soc_percent <= 0.75 ~ "50-75%",
      soc_percent <= 1.00 ~ "> 75%"
    ),
    soc_cat = ordered(soc_cat, levels = c("< 25%", "25-50%", "50-75%", "> 75%"))
  )

We can then try the sort again.

cms_data %>%
  select(school_name, grade_level_cat, frl_cat, spg_score) %>%
  serpentine(frl_cat, spg_score) %>%
  slice(1:10)

There, things look better. It's okay to implicitly stratify by a non-categorical variable, but it usually should be your last variable listed, and in most cases you should only have one of these types of variables.

Drawing an implicit sample

With an understanding of how implicitly stratified sampling works, actually drawing the sample is straightforward. Use the sample_implicit function on your data and indicate how many rows you want sampled. You do not need to sort the data ahead of time with this function, as it contains a place to indicate the variables on which you want to implicitly stratify. After running this function, you will get your same data back (sorted appropriately) with a new variable called in_sample that is TRUE if the row was selected and FALSE if not.

For example, let's sample 20 CMS schools after implicitly stratifying on grade-level, proportion FRL, proportion SOC, and the SPG score.

cms_sample <- sample_implicit(
  data = cms_data,
  n = 20,
  grade_level_cat, frl_cat, soc_cat, spg_score
)
cms_sample %>%
  select(school_name, in_sample) %>%
  slice(1:10)

To get just the 20 sampled schools, you can filter on in_sample

cms_sample %>%
  filter(in_sample == T)

This makes comparing sampled schools to non-sampled schools easy.

cms_sample %>%
  group_by(in_sample) %>%
  summarize(mean_frl= mean(frl_percent))

Accounting for size differences

The implicit stratification approach outlined above treats each row of data equally. In many cases, that makes sense, for example if each row was a teacher or a classroom. However, in some cases each row of data might represent a substantially different number of subunits, like classrooms or students. This is often the case when sampling schools. Treating each school equally means that students in small schools are more likely to be studied than students in large schools because the former's school is equally likely to be chosen as the latter, but once the school is chosen students in small schools are more likely to have one of their classrooms visited given the fewer options there.

sample_implicit includes an option to specify a variable in your data representing a measure of size. It will then account for this size and choose schools with a probability proportional to its size.

In the CMS data, we can use the total number of students in the schools as a measure of size and account for this when we draw the sample

cms_sample_withsize <- sample_implicit(
  data = cms_data,
  n = 20,
  grade_level_cat, frl_cat, soc_cat, spg_score,
  size_var = total_students_2018
)
cms_sample_withsize %>%
  filter(in_sample == T)

Notice how incorporating size increased the number of secondary schools in the sample. Because these schools typically contain more students, including size increased the chance they'd be chosen.

Other sampling approaches

Implicit stratified sampling is not the only technique. But rather than create functions to account for all sampling situations, many other approaches can be accomplished with existing R functions and packages, or by combining them with sample_implicit.

Simple random sampling

If you don't need or want to select a stratified sample, it's easiest to use already available functions. For a simple random sample -- i.e., the equivalent of throwing each row of data into a hat and selecting n of them at random -- I recommend sample_n from dplyr. This is the approach used earlier in the vignette to choose 10 random schools

set.seed(123)
cms_data %>%
  sample_n(10)

Explicitly stratifying

Explicitly stratifying a sample typically means selecting a fixed number of units for explicit groups (or strata). For example, if I wanted to ensure I selected 3 schools of each grade-level type, I'd first split the data by grade-type, then randomly select schools in each group. Again, it's easiest to implement this with already available tools: by combining sample_n and group_by from dplyr.

set.seed(123)
cms_data %>%
  group_by(grade_level_cat) %>%
  sample_n(3)

This approach guarantees a specific number of schools for each strata. This differs from implicitly stratifying, where one will get a sample that have units (e.g., schools) from the different strata, but there is no preordained or set number of units selected from each strata - in fact, if some strata contains few units then there there is a chance zero units from some strata will be chosen. Explicitly stratifying the data ensures each stratum is represented. This could be good or bad: if you force a unit to be chosen from each stratum, then the data may no longer be representative if the strata vary wildly in size. One would need to apply weights to the collected data to get back to representative values, which is a process seldom used at TNTP. At the same time, forcing a unit to be chosen from each stratum ensures no group is excluded, even if the data taken together is no longer representative of the overall population. Which approach to use depends on the needs of the project. If you're looking to get data that is representative of an entire school, district, network, etc. implicitly stratifying the data will likely get you there more easily.

Combining explicit and implicit stratification using purrr::map

Using existing functions, we can also combine explicit and implicit stratification. For example, what if we wanted three schools from each grade-level type, but within each grade-level we did not want to pick schools randomly; instead we wanted to implicitly stratify on the proportion of students receiving FRL and the SPG score. We can combine both stratification approaches with the help of the map function from the purrr package.

To implement, we split the data by grade-level type, then apply the sample_implicit function separately to each grade-type. We'll use the map_df function to combine the results back into a single data.frame all in one step.

library(purrr)
cms_data %>%
  split(.$grade_level_cat) %>%
  map_df(~ sample_implicit(.x, n = 3, frl_cat, spg_score)) %>%
  filter(in_sample)  %>%
  select(school_name, grade_level_cat)

Instead of selecting the same number of schools for each grade-level, we can instead vary how many schools are chosen by providing the map function additional information. In this case, we'll use map2_df so that we can tell it to use for the value of n the corresponding element in a list we give it. In the example below, we want to select 2 high schools, 3 middle schools, 4 primary schools, and 1 "other" school.

cms_data %>%
  split(.$grade_level_cat) %>%
  map2_df(
    .x = .,
    .y = list("High" = 2, "Middle" = 3, "Other" = 1, "Primary" = 4),
    .f = ~ sample_implicit(.x, n = .y, frl_cat, spg_score)
  ) %>%
  filter(in_sample) %>%
  select(school_name, grade_level_cat)

There above examples assume a familiarity with purrr and indeed there are other ways to use sample_implicit with other existing R functions to meet your needs. The point is to simply highlight how you can build upon sample_implicit to accomplish your sampling goals.

Questions?

Contact Adam Maier with questions.



adamMaier/tntpmetrics documentation built on Feb. 1, 2022, 1:03 p.m.