eeptools
is an R package that seeks to make it easier for analysts at
state and local education agencies to analyze and visualize their data on student,
school, and district performance. By putting simple wrappers around a number of
R functions, eeptools
strives to make many common tasks simpler and less prone
to error specific to analysis of education data.
knitr::opts_chunk$set( cache=FALSE, comment="#>", collapse=TRUE, echo=TRUE ) library(knitr); library(eeptools)
For analysts using unit-record data of some type, there are several calc
functions
which automate common tasks including calculating ages (age_calc
),
grade retention (retained_calc
), and student mobility (moves_calc
).
age_calc(dob = as.Date('1995-01-15'), enddate = as.Date('2003-02-16'), units = "years") age_calc(dob = as.Date('1995-01-15'), enddate = as.Date('2003-02-16'), units = "months") age_calc(dob = as.Date('1995-01-15'), enddate = as.Date('2003-02-16'), units = "days")
age_calc
also now properly accounts for leap years and leap seconds by default.
age_calc
can be passed a vector of dates of birth and a vector of end dates or
a single end-date and produce a vector of ages as well -- suitable for computing
student age on the fly from date-of-birth records.
retained_calc
takes a vector of student identifiers and a vector of grades and
checks whether or not the student was retained in the grade level specified by
the user. It returns a data.frame of all students who could have been retained
and a yes or no indicator of whether they were retained.
x <- data.frame(sid = c(101, 101, 102, 103, 103, 103, 104, 105, 105, 106, 106), grade = c(9, 10, 9, 9, 9, 10, 10, 8, 9, 7, 7)) retained_calc(df = x, sid = "sid", grade = "grade", grade_val = 9)
retained_calc
is intended to be used after you have processed your data as it
does not take into account time or sequence other than the order in which the
data is passed to it.
moves_calc
is intended to identify based on enrollment dates whether a student
experienced a school move within a school year.
df <- data.frame(sid = c(rep(1,3), rep(2,4), 3, rep(4,2)), schid = c(1, 2, 2, 2, 3, 1, 1, 1, 3, 1), enroll_date = as.Date(c('2004-08-26', '2004-10-01', '2005-05-01', '2004-09-01', '2004-11-03', '2005-01-11', '2005-04-02', '2004-09-26', '2004-09-01','2005-02-02'), format='%Y-%m-%d'), exit_date = as.Date(c('2004-08-26', '2005-04-10', '2005-06-15', '2004-11-02', '2005-01-10', '2005-03-01', '2005-06-15', '2005-05-30', NA, '2005-06-15'), format='%Y-%m-%d')) moves <- moves_calc(df, sid = "sid", schid = "schid", enroll_date = "enroll_date", exit_date = "exit_date") moves
Another set of key functions in the package are to make basic data manipulation
easier. One thing users of other statistical packaegs may miss when using R is
a convenient function for determining the mode
of a vector. The statamode
function is designed to do just that. statamode
works with numeric, character,
and factor data types. It also includes various options for how to deal with a
tie demonstrated below.
vecA <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) statamode(vecA, method = "stata") vecB <- c(1, 1, 1, 3:10) statamode(vecB, method = "last") vecC <- c(1, 1, 1, NA, NA, 5:10) statamode(vecC, method = "last") vecA <- c(LETTERS[1:10]); vecA <- factor(vecA) statamode(vecA, method = "last") vecB <- c("A", "A", "A", LETTERS[3:10]); vecB <- factor(vecB) statamode(vecB, method = "last") vecA <- c(LETTERS[1:10]) statamode(vecA, method = "sample") vecB <- c("A", "A", "A", LETTERS[3:10]) statamode(vecB, method = "stata") vecC <- c("A", "A", "A", NA, NA, LETTERS[5:10]) statamode(vecC, method = "stata")
There are a number of functions to save you keystrokes like defac
for converting
a factor to a character, makenum
for turning a factor variable into a numeric
variable, max_mis
for taking the maximum of a vector of numerics and ignoring
any NAs (useful for inclusion in do.call
or apply
constructions). remove_char
allows you to quickly gsub
out a specific character from a string vector such
as an *
or ...
. decomma
is a somewhat specialized version of this for
processing data where numerics are written with commas. nth_max
allows you to
identify the 2nd, 3rd, etc. maximum value in a vector.
eeptools
includes ways to simplify the use of regression analyses tools recommended
by Gelman and Hill 2006 through the gelmansim
function, which itself is a wrapper
for the arm::sim()
function. This function allows the distribution of predicted
values to be generated automatically which is useful for gauging uncertainty in
a statistical model and also to compare predictions from multiple models on the
same case data to see if the values of those models overlap or are distinct
from one another.
library(MASS) #Examples of "sim" set.seed (1) J <- 15 n <- J*(J+1)/2 group <- rep (1:J, 1:J) mu.a <- 5 sigma.a <- 2 a <- rnorm (J, mu.a, sigma.a) b <- -3 x <- rnorm (n, 2, 1) sigma.y <- 6 y <- rnorm (n, a[group] + b*x, sigma.y) u <- runif (J, 0, 3) dat <- cbind (y, x, group) # Linear regression dat <- as.data.frame(dat) dat$group <- factor(dat$group) M3 <- glm (y ~ x + group, data=dat) cases <- expand.grid(x = seq(-2, 2, by=0.1), group=seq(1, 14, by=2)) cases$group <- factor(cases$group) sim.results <- gelmansim(mod = M3, newdata = cases, n.sims=200, na.omit=TRUE) head(sim.results)
There is also a ggplot2
version of plot.lm
included:
data(mpg) mymod <- lm(cty~displ + cyl + drv, data=mpg) autoplot(mymod)
Finally, there is a convenient method for creating labeled mosaic plots.
sampDat <- data.frame(cbind(x=seq(1,3,by=1), y=sample(LETTERS[6:8], 60, replace=TRUE)), fac=sample(LETTERS[1:4], 60, replace=TRUE)) varnames<-c('Quality','Grade') crosstabplot(sampDat, "y", "fac", varnames = varnames, label = TRUE, title = "Crosstab Plot", shade = FALSE)
And without labels:
crosstabplot(sampDat, "y", "fac", varnames = varnames, label = FALSE, title = "Crosstab Plot", shade = TRUE)
eeptools
provides three new datasets of interest to education researchers. These
datasets are also used in the R Bootcamp for Education Analysts
library(eeptools) data("stuatt") head(stuatt)
The stuatt
, student attributes, dataset is provided from the
Strategic Data Project Toolkit for Effective Data Use.
This dataset is useful for learning how to clean data in R and how to aggregate
and summarize individual unit-record data into group-level data.
data(stulevel) head(stulevel)
The stulevel
dataset is a simulated student-level longitudinal record. It contains
student and school level attributes and is useful for practicing evaluating
longitudinal analyses of student unit-record data.
data("midsch") head(midsch)
The midsch
dataset contains an analysis on abnormality in school average
assessment scores. It contains observed and predicted values of aggregated
test scores at the school level for a large midwestern state.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.