knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "tools/figs/README-",
  message = FALSE,
  warning = FALSE
)

OpenSDPsynthR

A project to generate realistic synthetic unit-level longitudinal education data to empower collaboration in education analytics.

Design Goals

  1. Generate synthetic education data that is realistic for use by analysts across the education sector. Realistic means messy, and reflective of the general pattern of relationships found in the U.S. education sector.
  2. Synthetic data should be able to be generated on-demand and responsive to inputs from the user. These inputs should allow the user to configure the process to produce data that resembles the patterns of data in their agency.
  3. The package should be modular and extendable allowing new data topics to be generated as needed so synthetic data coverage can grow.

Structure

The package is organized into the following functions:

Get Started

To use OpenSDPsynthR, follow the instructions below:

Install Package

The development version of the package is able to be installed using the install_github(). To use this command you will need to install the devtools package.

devtools::install_github("opensdp/OpenSDPsynthR")

Make some data

Load the package

library(OpenSDPsynthR)

The main function of the package is simpop which generates a list of data elements corresponding to simulated educational careers, K-20, for a user specified number of students. In R, a list is a data structure that can contain multiple data elements of different structures. This can be used to emulate the multiple tables of a Student Information System (SIS).

out <- simpop(nstu = 500, seed = 213, control = sim_control(nschls = 3))

Currently ten tables are produced:

names(out)

Data elements produced include:

There are two tables of metadata about the assessment data above to be used in cases where multiple types of student assessment are analyzed together.

table_names <- data.frame(table = NULL, column = NULL)
for(i in seq_along(out)){
  table_name <- names(out)[[i]]
  columns <- names(out[[i]])
  tmp <- data.frame(table = table_name, column = columns,
                    stringsAsFactors = FALSE)
  table_names <- bind_rows(table_names, tmp)
}
head(out$demog_master %>% arrange(sid) %>% select(1:4))
head(out$stu_year, 10)

Cleaners

You can reformat the synthetic data for use in specific types of projects. Currently two functions exist to format the simulated data into an analysis file matching the SDP College-going data specification and a CEDS-like data specification. More of these functions are planned in the future.

cgdata <- sdp_cleaner(out)
ceds <- ceds_cleaner(out)

Control Parameters

By default, you only need to specify the number of students to simulate to the simpop command. The package has default simulation parameters that will result in creating a small school district with two schools.

names(sim_control())

These parameters can have complex structures to allow for conditional and random generation of data. Parameters fall into four categories:

For more details, see the simulation control vignette.

vignette("Controlling the Data Simulation", package = "OpenSDPsynthR")

Package Dependencies

OpenSDP

OpenSDPsynthR is part of the OpenSDP project.

OpenSDP is an online, public repository of analytic code, tools, and training intended to foster collaboration among education analysts and researchers in order to accelerate the improvement of our school systems. The community is hosted by the Strategic Data Project, an initiative of the Center for Education Policy Research at Harvard University. We welcome contributions and feedback.

These materials were originally authored by the Strategic Data Project.



OpenSDP/OpenSDPsynthR documentation built on June 20, 2020, 6:18 a.m.