sitePickR: Getting Started

options(rmarkdown.html_vignette.check_title = FALSE)
knitr::opts_chunk$set(warning = FALSE, message = FALSE, fig.width=8, fig.height=5) 
knitr::opts_chunk$set(
  collapse = FALSE,
  comment = "#>"
)

Getting Started: Sampling AY 2017-18 California schools and districts

Elena Badillo-Goicoechea, Robert Olsen, and Elizabeth A. Stuart

2022-11-29

Introduction

sitepickR is designed to select a representative sample of sites for a prospective impact evaluation, such as a randomized controlled trial (RCT).

The main function in this package, selectMatch() lets the user carry out a two-level sample selection where the possibility of an initially selected unit not wanting to participate is anticipated. The procedure aims to reduce the bias (and/or loss of generalizability to the target population) this could introduce.

In selecting units and sub-units, sitepickR uses the cube method (e.g., Deville & Tillé, 2004; Tillé 2011). The cube method is a probability sampling method that is designed to satisfy criteria for balance between the sample and the population. Recent research has shown that this method performs well in simulations for studies of educational programs (Fay & Olsen, under review). To implement the cube method, sitepickerR uses the sampling R package. Users have the option to select units with equal probabilities or with probabilities proportional to their "size" measured in terms of the number of sub-units nested within units.

In addition, sitepickR uses statistical matching to select possible replacement units. In education RCTs, the share of selected districts that agrees to participate tends to be low. To address this challenge, sitepickR selects and ranks up to K replacement districts for each districts selected using the cube method. Replacement districts are selected using statistical matching based on propensity score methods. To implement statistical matching, sitepickR uses the MatchIt R package.

sitepickR's core sampling + matching procedure, implemented with the selectMatch() function, consists of four main steps:

  1. Study sample design, where we:

    • Identify a target population with a unit : sub-unit nested structure (e.g. [district A : schools in district A])
    • Set key parameter values (e.g. initial sample size, desired number of matches per unit)
    • Specify a set of observable covariates of interest at the unit level
    • Specify a set of sub-unit level observable covariates of interest
  2. Select a random sample of units from the target population.

  3. Obtain a list of best K matches for each initially selected unit, where match no. 1 is closest to a given original unit and match no. K is furthest, in terms of the covariates of interest and key parameter values.

These K matches will be the potential replacement units for each initially selected unit, in case their corresponding original unit is unable to participate in the study, and they are taken (with or without repetition, which you can set with the repFlag argument) from the pool of units that did not get selected in the initial random sampling procedure.

  1. Assess balance and match quality in terms of the covariates of interest:
    • Balance between initially selected ('original') units and the target population.
    • Balance between original units and each group of matches (1 to K, from closest to furthest).
    • Number of successful matches obtained by the procedure, given the restrictions imposed.
    • Balance between sub-units associated to each unit replacement group and the original sub-units in the population, in terms of available covariates of interest, both at the unit and sub-unit level.

This vignette will guide the reader, step-by-step, on sitepickR's basic functionalities, with data from the Common Core of Data (CCD) for California schools (2017-18), using a pre-processed dataset that is included with the package installation.

Technical details on each of the package's main functions and on the sample dataset is included in the documentation, which the reader is encouraged to consult, if necessary.

Common Core of Data

The CCD is the U.S. Department of Education's primary database on public elementary and secondary education in the United States. CCD is a comprehensive, annual, national database of all public elementary and secondary schools and school districts.

In addition to administrative data, it provides relevant information about schools and districts, disaggregated by demographics (grade, race/ethnicity, and sex) and socioeconomic factors (school-level counts of student eligible for free and reduced-price lunches).

Package and data set up

First, if needed, install the sitepickR package with the help of devtools:

if(!require(devtools)){
    install.packages("devtools", repos = "http://cran.us.r-project.org")
}

if(!require(sitepickR)){
    devtools::install_github("ElenaBadilloG/sitepickR")
    }

And load the package by calling its library:

library(sitepickR)

Now let's load the sample CCD-California 2017-18 dataset that comes with the package:

rawCCD <- sitepickR::rawCCD
knitr::kable(head(rawCCD), format = "html")

Process the sample aggregate dataset in order for it to be in the exact format expected by selectMatch by using sitepckr prepDF() function.

This simple step will re-define some key variables, and will create a new variable, 'unitSize', that will be used in the step of the selectMatch procedure where units are initially selected. Specifically, the cube sampling method implicit in selectMatch allows us to select units in a way that is not biased by its number of sub-units (which we would usually want to impose). This is explained in more detail in this section of the sampling documentation. What matters here, is that, having created a unitSize column, with selectMatch we can let the sampling step correct by size by setting the sizeFlag input to TRUE.

dfCCD <- prepDF(rawCCD,
                 unitID="LEAID", subunitID="NCESSCH")
knitr::kable(dfCCD[1:10,(ncol(dfCCD)-5-1):ncol(dfCCD)], format = "html")

Study sample design

Define key input values

Since there is some underlying random sampling in selectMatch, we can set a seed for replication purposes (optional):

seed <- 1122 

Set distance tolerance for restricted covariates:

calip <- 0.2 # maximum standard deviations of covariate distance around target population

Set output sample sizes

Set size for: 1) initial unit ("district", in our example) sample, Nu; 2) number of desired matches per unit, K; and 3) the number of sub-units ("schools", in this example) that will be randomly selected for each potential unit:

Nu <- 100 # district sample size
K <- 10 # number of matches per district
Ns <- 5 # number of schools per potential district

Specify covariates of interest

Now, specify unit level covariates on which you'll want to match selected units (again "districts" here) with their replacement candidates from the non-selected sample:

uSampVarsCCD <- c("w.pct.frlunch", "w.pct.black", "w.pct.hisp", "w.pct.female") 

Similarly, specify sub-unit ("school") level covariates on which you'll want to match districts.

In our CCD example, these covariates are just school level aggregates of conceptually the same district level underlying variable (e.g. percent of Hispanic students).

However, these could be any other sub-unit level variables available in your own dataset, not necessarily expressing the same variable as some other unit level covariate.

suSampVarsCCD <- c("sch.pct.frlunch", "sch.pct.black", "sch.pct.hisp", "sch.pct.female")

You can (optionally) specify covariates on which to exactly match units. These would usually be categorical covariates, with relatively few categories --otherwise the matching could fail, being too restrictive or even empty:

exactMatchVars <-  c("distr.type")

Similarly, you can optionally specify covariates on which to match units within a given numeric caliper, in terms of standard deviations. Such covariates are expected to be numeric variables.

In this case, we are interested in caliper-matching on three numeric covariates: percentage of Black, Hispanic, and female students in the district.

Notice that the lower the caliper, the more restrictive the matching (which could result in ferwe proportion of successful matches per replacement group, or even empty ones):

calipMatchVars <-  c("w.pct.black", "w.pct.hisp", "w.pct.female")

Get unit matches

We'll now leverage sampling + MatchIt packages joint functionality, and match districts with their candidate replacements among the non-initially selected units using Mahalanobis distance. We could also do this using propensity scores, as specified on MatchIt, and the output structure would be exactly the same:

smOut <- selectMatch(df = dfCCD, # user dataset
                       unitID = "LEAID", # column name of unit ID in user dataset
                       subunitID = "NCESSCH", # column name of sub-unit ID in user dataset
                       unitVars = uSampVarsCCD, # name of unit level covariate columns
                       subunitSampVars = suSampVarsCCD, # name of sub-unit level covariate columns
                       exactMatchVars = exactMatchVars, # unit level categorical covariates on which to match exactly
                       calipMatchVars = calipMatchVars, # unit level numeric covariates on which to match within a radius
                       nUnitSamp = Nu, # original unit sample size
                       nRepUnits = K, # number of desired matches per initially selected unit
                       nsubUnits = Ns, # number of sub-units to sample from each candidate unit
                       calipValue = calip, # maximum distance on which to match specified unit level covariates (calipMatchVars)
                       seedN = seed, # random seed number
                       matchDistance = "mahalanobis", # metric used for matching units
                       sizeFlag = TRUE,
                       repFlag = FALSE, # pick matches without repetition
                       writeOut = FALSE, # write out a csv file for: 1) matched units and 2) selected sub-units
                       replacementUnitsFilename = "replacementsTable.csv", # filename for {districtA: [districtA replacement list]} table
                       subUnitTableFilename = "schoolsDirectory.csv" # filename for {districtA: [districtA schools]} table
)

At this point, the two main outputs of the selectMatch procedure have been produced, and are stored inside the object we' have 've just defined as 'smOut'. In our example, these two main outputs are:

  1. A table with 10 matches (replacement candidates) for each of the 100 school districts we initially selected. The matches are ordered from the most to the least 'similar' to their original district (where 'similar' is close, in terms of covariate distance).

  2. A table with 5 (or less, when there are less than 5 schools available for a given district) randomly (if applicable) selected schools for each of the 100 original and each of the 100 x 10 candidate districts. Of course, each of these 5 schools are sampled exclusively from those that belong to each school district:

districtReplTable <- smOut[[1]]
knitr::kable(head(districtReplTable), format = "html")
schoolDirectory <- smOut[[2]]
knitr::kable(head(dplyr::filter(schoolDirectory, !is.na(Sub_unit5_ID))), format = "html")

Each of these two tables can be automatically stored as .csv files in your local computer, by simply setting writeOut = TRUE in the main function call to selectMatch. In case you use this feature, you can given those two files a specific name, using the [replacementUnitsFilename]repFlag and [subUnitTableFilename]repFlag selectMatch arguments (or just keeping their default values).

Assess balance and match quality

Now that we have our potential districts replacements, we'd like to know how 'similar' they really are to both the target population and the originally selected sample. Similarly, we'd like to know how well the schools that belong to each of these districts represent the schools of our target population, in terms of our covariates of interest. In the end, what we want is to be able to retain the original representativensess (imposed by our initial random sampling step) in our study design as much as possible, even when units "self-select".

sitepickR provides the user with several functions to carry out these balance diagnostics. By default --and following the standard literature-- balance diagnostics in sitepickR are expressed in terms of standardized mean difference (SMD) between two comparison groups of interest. In our case, these two comparison groups we'll be: 1) initially selected districts (or schools) against target population; 2) initially selected districts (or schools) against each of the 10 district replacement groups.

In addition, sitepickR has functions to find out how 'successful' the matching procedure was, in terms of how often was it possible to find exactly K, K-1, ..., 1, and 0 matches for the original units --possibly indicating we are asking for 'too much' in terms of maximum distance, or exact matching constraints, and thus, letting us re-adjust our parameters accordingly.

1. Original units vs. target population

First, we want to look at the overall balance between the initially selected districts (i.e. the group of districts that were selected in the initial cube sampling step of the selectMatch procedure) and all the districts in the population.

While no matching has taken place at this point (so, technically, selectMatch has not done its job yet), it might still be of interest to look at this initial balance, in order to have a benchmark for the quality of the subsequent matches:

unitLovePlot(smOut,
   title = "Standardized Mean Difference: \n Initially Selected Units vs. Population")

We can also take a look at a table with the actual SMD for each covariate of interest applying the getSummary() function, which takes the :

unitBalanceTab <- getSummary(smOut, diagnostic="unitBal")

knitr::kable(head(unitBalanceTab, 10), format = "html")

2. Original units vs. replacement candidates

Now, to assess the quality of our resulting district matches, we look at the balance (in terms of SMD) between the group of initially selected districts and each of its K replacement districts groups, where group 1 is composed of the first closest matches ordered in terms of distance, group 2 corresponds to the second-best matches, and so on:

matchBalance(smOut, 
  title = "Standardized Mean Difference: \
  Replacement Unit Groups (1...K) vs. Originally Selected Units")

Overall, we expect the SMD to increase from 1 to K. However, the decrease is not necessarily monotonic for each covariate (recall that the distance is computed over all of the covariates, simultaneously!). This diagnostic is quite interesting and relevant since it lets us know the quality of the matching for each replacement group and for each covariate separately. This way we can know more about the strengths and weaknesses of our final study sample.

In addition, it let's us diagnose how strict our matching restrictions are overall, given our data. A steeper curve would indicate rapidly match quality is rapidly deteriorating as we go from the K-1 to the K best replacement, while a flatter one would indicate that pretty much all of our candidate matches are similarly good.

matchBalanceTab <- getSummary(smOut, diagnostic="matchBal")
knitr::kable(head(matchBalanceTab, 20), format = "html")

3. Successful matches

Besides inspecting balance metrics, we might want to look at how many matches were actually computed for our original units. sitepickR lets us do this in both ways:

  1. matchFreq(): distribution of successful matches across all original units. So this is the frequency (from 1 to K) of computed matches for the initially selected sub-population. This gives us an idea of how successful our matching procedure was, overall.

  2. matchCount(): The number of successful matches between the original districts and replacement candidate districts from the non-selected pool, for each of their K replacements groups (which are by default sorted from most to least similar 1...K). So for the group of best matches (i.e. Replacement group 1), how many matches were computed by selectMatch, how many for the group of second-best matches, and so on. This gives us an idea of how restrictive our parameter settings were, taken together, and given our dataset structure. More specifically, the number of matches for each replacement group is expected to be: 1) decreasing in K, the number of matches (i.e. replacement candidates) we are asking for, and inversely related to restrictions imposed by exact matching, caliper value, and inherent structure and size of the data.

3.1 Match histogram

First, let's look at the overall match frequency the original units got, overall:

matchFreq(smOut,
             title="Match Frequency per Original Unit")

3.2 Successfully computed unit matches per unit group (1...K)

Now let's examine the percent of matches for each replacement group, sorted from closest to furthest units. Here, 100% for replacement group 1 means that all units got a best-match matches were successful, 50% for replacement group 2 means that only half of all original units got a 2nd best match, and so on:

matchCount(smOut,
             title="% of Successful Matches per Unit Group")

Again, we can, use getSummary() to access each of these stats directly in a table:

matchFreqTab <- getSummary(smOut, diagnostic="matchFreq")

knitr::kable(matchFreqTab, format = "html")
matchCountTab <- getSummary(smOut, diagnostic="matchCount")

knitr::kable(matchCountTab, format = "html")

4. Sub-units from original vs. candidate units

Ultimately, the procedure wants to yield not just districts, but schools that are similar to the target population, in terms of all specified covariates of interest (both the district and school level ones). sitepickR's subUnitBalance() function lets us assess balance precisely on this terms:

subUnitBalance(smOut,
     title="Standardized Mean Difference: \n Sub-units from Original + Replacement Unit Groups vs. Population")

Similary, use getSummary() function to obtain balance directly in a table:

subUnitBalanceTab <- getSummary(smOut, diagnostic="subunitBal")
knitr::kable(subUnitBalanceTab, format = "html")


Try the sitepickR package in your browser

Any scripts or data that you put into this service are public.

sitepickR documentation built on Dec. 5, 2022, 5:25 p.m.