In jknowles/eeptools: Convenience Functions for Education Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "tools/readme/README-"
)

Introduction

eeptools is an R package that seeks to make it easier for analysts at state and local education agencies to analyze and visualize their data on student, school, and district performance. By putting simple wrappers around a number of R functions, eeptools strives to make many common tasks simpler and less prone to error specific to analysis of education data.

Datasets

eeptools provides three new datasets of interest to education researchers. These datasets are also used in the R Bootcamp for Education Analysts

library(eeptools)
data("stuatt")
head(stuatt)

The stuatt, student attributes, dataset is provided from the Strategic Data Project Toolkit for Effective Data Use. This dataset is useful for learning how to clean data in R and how to aggregate and summarize individual unit-record data into group-level data.

data(stulevel)
head(stulevel)

The stulevel dataset is a simulated student-level longitudinal record. It contains student and school level attributes and is useful for practicing evaluating longitudinal analyses of student unit-record data.

data("midsch")
head(midsch)

The midsch dataset contains an analysis on abnormality in school average assessment scores. It contains observed and predicted values of aggregated test scores at the school level for a large midwestern state.

Administrative Data Functions

For analysts using unit-record data of some type, there are several calc functions which automate common tasks including calculating ages (age_calc), grade retention (retained_calc), and student mobility (moves_calc).

age_calc(dob = as.Date('1995-01-15'), enddate = as.Date('2003-02-16'), 
         units = "years")
age_calc(dob = as.Date('1995-01-15'), enddate = as.Date('2003-02-16'), 
         units = "months")
age_calc(dob = as.Date('1995-01-15'), enddate = as.Date('2003-02-16'), 
         units = "days")

age_calc also now properly accounts for leap years and leap seconds by default.

retained_calc takes a vector of student identifiers and a vector of grades and checks whether or not the student was retained in the grade level specified by the user. It returns a data.frame of all students who could have been retained and a yes or no indicator of whether they were retained.

x <- data.frame(sid = c(101, 101, 102, 103, 103, 103, 104, 105, 105, 106, 106),
                 grade = c(9, 10, 9, 9, 9, 10, 10, 8, 9, 7, 7))
retained_calc(df = x, sid = "sid", grade = "grade", grade_val = 9)

retained_calc is intended to be used after you have processed your data as it does not take into account time or sequence other than the order in which the data is passed to it.

moves_calc is intended to identify based on enrollment dates whether a student experienced a school move within a school year.

df <- data.frame(sid = c(rep(1,3), rep(2,4), 3, rep(4,2)),
                   schid = c(1, 2, 2, 2, 3, 1, 1, 1, 3, 1),
                   enroll_date = as.Date(c('2004-08-26',
                   '2004-10-01', '2005-05-01', '2004-09-01',
                   '2004-11-03', '2005-01-11', '2005-04-02',
                   '2004-09-26', '2004-09-01','2005-02-02'), format='%Y-%m-%d'),
                   exit_date = as.Date(c('2004-08-26', '2005-04-10',
                    '2005-06-15', '2004-11-02', '2005-01-10',
                    '2005-03-01', '2005-06-15', '2005-05-30',
                    NA, '2005-06-15'), format='%Y-%m-%d'))

moves <- moves_calc(df, sid = "sid", schid = "schid", enroll_date = "enroll_date", 
                    exit_date = "exit_date")
moves

Manipulate Data

Another set of key functions in the package are to make basic data manipulation easier. One thing users of other statistical packaegs may miss when using R is a convenient function for determining the mode of a vector. The statamode function is designed to do just that. statamode works with numeric, character, and factor data types. It also includes various options for how to deal with a tie demonstrated below.

vecA <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
statamode(vecA, method = "stata")
vecB <- c(1, 1, 1, 3:10)
statamode(vecB, method = "last")
vecC <- c(1, 1, 1, NA, NA, 5:10)
statamode(vecC, method = "last")
vecA <- c(LETTERS[1:10]); vecA <- factor(vecA)
statamode(vecA, method = "last")
vecB <- c("A", "A", "A", LETTERS[3:10]); vecB <- factor(vecB)
statamode(vecB, method = "last")
vecA <- c(LETTERS[1:10])
statamode(vecA, method = "sample")
vecB <- c("A", "A", "A", LETTERS[3:10])
statamode(vecB, method = "stata")
vecC <- c("A", "A", "A", NA, NA, LETTERS[5:10])
statamode(vecC, method = "stata")

There are a number of functions to save you keystrokes like defac for converting a factor to a character, makenum for turning a factor variable into a numeric variable, max_mis for taking the maximum of a vector of numerics and ignoring any NAs (useful for inclusion in do.call or apply constructions). remove_char allows you to quickly gsub out a specific character from a string vector such as an * or .... decomma is a somewhat specialized version of this for processing data where numerics are written with commas. nth_max allows you to identify the 2nd, 3rd, etc. maximum value in a vector.

Regression Models

eeptools includes ways to simplify the use of regression analyses tools recommended by Gelman and Hill 2006 through the gelmansim function, which itself is a wrapper for the arm::sim() function.

require(MASS)
#Examples of "sim" 
set.seed (1)
J <- 15
n <- J*(J+1)/2
group <- rep (1:J, 1:J)
mu.a <- 5
sigma.a <- 2
a <- rnorm (J, mu.a, sigma.a)
b <- -3
x <- rnorm (n, 2, 1)
sigma.y <- 6
y <- rnorm (n, a[group] + b*x, sigma.y)
u <- runif (J, 0, 3)
dat <- cbind (y, x, group)
# Linear regression 
dat <- as.data.frame(dat)
dat$group <- factor(dat$group)
M3 <- glm (y ~ x + group, data=dat)
cases <- expand.grid(x = seq(-2, 2, by=0.1), 
                     group=seq(1, 14, by=2))
cases$group <- factor(cases$group)
sim.results <- gelmansim(mod = M3, newdata = cases, n.sims=200, na.omit=TRUE)
head(sim.results)

There is also a ggplot2 version of plot.lm included:

data(mpg)
mymod <- lm(cty~displ + cyl + drv, data=mpg)
autoplot(mymod)

Finally, there is a convenient method for creating labeled mosaic plots.

sampDat <- data.frame(cbind(x=seq(1,3,by=1), y=sample(LETTERS[6:8], 60, 
                                                        replace=TRUE)),
                        fac=sample(LETTERS[1:4], 60, replace=TRUE))
varnames<-c('Quality','Grade')
crosstabplot(sampDat, "y", "fac", varnames = varnames,  label = TRUE, 
             title = "Crosstab Plot", shade = FALSE)

Helping Out

Review the Contributor Guide for specific directions and tips on how to get involved.

eeptools is intended to be a useful project for the education analytics community. Contributions are welcomed. Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

jknowles/eeptools documentation built on Aug. 30, 2023, 10:05 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com