The OpenSDPsynthR
package provides a set of functions to allow users to
generate synthetic, but authentic, student-level data. Synthetic data is
intended to make it easy to collaborate with analysts across the country
tackling similar problems but using a shared vocabulary. The synthetic data
will allow users to collaborate directly on code and analyses and verify that
their analysis is working on synthetic data before translating it to live
local data.
This vignette explains how the package is structured so that it can be modified to meet the needs of users.
library(OpenSDPsynthR) default_sim <- sim_control()
names(default_sim)
There are over 25 user parameters that can be modified to control the simulation. The goal of these parameters is to allow the simulated student population to reflect a variety of possible educational environments ranging from small rural communities to large urban school districts.
These parameters can have complex structures to allow for conditional and random generation of data. Parameters fall into four categories:
simglm
functionThe following vectors can be modified by the users:
nschls
: integer, number of high schools to assign students tobest_schl
: character, length 1, school ID for the highest performing school,
e.g. ("01")race_groups
: character, length ?, names of racial subgroups to create in the
simulation, defaults to US Census Groupsrace_prob
: numeric, length = length(race_groups), proportion of population
in each racial groupminyear
: integer, length 1, the first year of student data availablemaxyear
: integer, length 1, the last year of student data availablen_cohorts
: integer, length 1, the number of graduation cohorts to createschool_names
: character, length = nschls
, names of schoolsassess_grades
: character, grade levels to simulate assessment scores forpostsec_names
: character, length = n_postsec
, names of postsecondary schoolspostsec_method
: character, length = 1, name of method to draw postsecondary
schools from A conditional probability list is a list of lists in R. The GROUPVARS
element
specifies the grouping variable to conditionally assign probabilities. For
example, if students are assigned gifted and talented status differently based
on their sex, then this would specify Sex
. The other elements of the list
will be a separate list for each valid value of Sex
-- in this case Male
and Female
.
Male
and Female
are both lists that have two elements: f
and pars
. f
defines a function that is used to generate the variable, and pars
contains
all of the parameters for that function.
str(default_sim$gifted_list)
gifted_list
: a list defining how students are assigned to gifted and talented
programsiep_list
: a list defining how students are assigned to special education
programsses_list
: a list deifning how students are assigned to free and reduced
price lunch statusell_list
: a list defining how students are assigned to English Language
Learner statusps_transfer_list
: a list defining the likelihood a student transfers
postsecondary institutionsOutcome simulation controls are lists with parameters to pass to the simreg
function in the simglm
package, which simulates hierarchical data and
outcomes.
Each of these simulations requires the user to specify:
fixed
: a RHS formula of the format ~ 1 + var1 + var2
defining the level
1 variables for the simulationrandom_var
: a numeric, length 1, specifying the variance in the second level cov_param
: a list, length of variables in fixed
+ 1 for the intercept,
defines the function and parameters to generate the X values cor_vars
: a matrix of the variance between the X variables in fixed
fixed_param
: a vector of numerics, the lenth of fixed
+ 1, represent the
beta coefficientsngrps
: numeric, length of 1, number of second-level grouping termsunbalanceRange
: numeric, length of 2, representing the minimum and maximum
number of observations in each second-level clustertype
: character, either "linear" or NULLThere are several of these parameters:
gpa_sim_parameters
: simulation parameters for the GPA simulationgrad_sim_parameters
: simulation parameters for high school graduationps_sim_parameters
: simulation parameters for postsecondary enrollmentassess_sim_par
: simulation parameters for student assessment dataIf we only rely on the simulation controls above, the data will be too predictable to be realistic, and structural inequalities along economic, racial, and gender lines will be underrepresented. To address this, it is possible to do post-simulation adjustments to introduce more variance to the outcomes.
race_list
: perturb_race
: function, frl_list
: perturb_frl
: function,
assessment_adjustment
: adjustments to the assessment score
grad_adjustment
: adjustments to the graduation probabilityps_adjustment
: adjustments to the postsecondary probabilitygpa_adjustment
: adjustments to the grade point averageCurrently there are two special parameters that are set based on baseline data
built into the package. These are the initial grade distribution of students,
and the initial program participation of students in ell
, iep
, and frpl
programs.
These set some of the simulation requirements, but others are set using the
baseline
function family.
get_baseline("program") get_baseline("grade")
Currently, baseline values cannot be modified by the user, but this will come in a future release.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.