Introduction

The OpenSDPsynthR package provides a set of functions to allow users to generate synthetic, but authentic, student-level data. Synthetic data is intended to make it easy to collaborate with analysts across the country tackling similar problems but using a shared vocabulary. The synthetic data will allow users to collaborate directly on code and analyses and verify that their analysis is working on synthetic data before translating it to live local data.

This vignette explains how the package is structured so that it can be modified to meet the needs of users.

library(OpenSDPsynthR)
default_sim <- sim_control()
names(default_sim)

There are over 25 user parameters that can be modified to control the simulation. The goal of these parameters is to allow the simulated student population to reflect a variety of possible educational environments ranging from small rural communities to large urban school districts.

These parameters can have complex structures to allow for conditional and random generation of data. Parameters fall into four categories:

Vectors

The following vectors can be modified by the users:

Conditional Probability List

A conditional probability list is a list of lists in R. The GROUPVARS element specifies the grouping variable to conditionally assign probabilities. For example, if students are assigned gifted and talented status differently based on their sex, then this would specify Sex. The other elements of the list will be a separate list for each valid value of Sex -- in this case Male and Female.

Male and Female are both lists that have two elements: f and pars. f defines a function that is used to generate the variable, and pars contains all of the parameters for that function.

str(default_sim$gifted_list)

Outcome Simulation Controls

Outcome simulation controls are lists with parameters to pass to the simreg function in the simglm package, which simulates hierarchical data and outcomes.

Each of these simulations requires the user to specify:

There are several of these parameters:

Outcome Simulation Adjustments

If we only rely on the simulation controls above, the data will be too predictable to be realistic, and structural inequalities along economic, racial, and gender lines will be underrepresented. To address this, it is possible to do post-simulation adjustments to introduce more variance to the outcomes.

Baselines

Currently there are two special parameters that are set based on baseline data built into the package. These are the initial grade distribution of students, and the initial program participation of students in ell, iep, and frpl programs.

These set some of the simulation requirements, but others are set using the baseline function family.

get_baseline("program")
get_baseline("grade")

Currently, baseline values cannot be modified by the user, but this will come in a future release.



OpenSDP/OpenSDPsynthR documentation built on June 20, 2020, 6:18 a.m.