knitr::opts_chunk$set(fig.width = 6, message = FALSE, warning = FALSE, comment = "", cache = F)
```{css, eval = TRUE, echo = FALSE} .remark-code, .remark-code-inline{font-size: 80%}, @media print { .has-continuation { display: block; } }
```r library(flipbookr) library(knitr) #not part of walkthrough
dplyr::mutate_at()
readRDS()
purrr::map and purrr::map_df()
purrr::set_names()
dplyr::slice()
This chapter explores what aggregate data is, and how to access, clean, and explore it.
Aggregate data refers to numerical information (or non-numerical information, such as the names of districts or schools) that has the following characteristics:
In this chapter, we’ll focus on educational equity by identifying and comparing patterns in student demographic groups.
--
library(tidyverse) # Create student-level data tibble( student = letters[1:10], school = rep(letters[11:15], 2), test_score = sample(0:100, 10, replace = TRUE) )
class: middle
Aggregate data totals up test_score to “hide” the student-level information. The rows of the resulting dataset represent the group level, here school.
tibble( student = letters[1:10], school = rep(letters[11:15], 2), test_score = sample(0:100, 10, replace = TRUE) ) %>% # Aggregate by school group_by(school) %>% summarize(mean_score = mean(test_score))
Common disaggregations for students include gender, race/ethnicity, socioeconomic status, English learner designation, and whether they are served under the Individuals with Disabilities Education Act (IDEA)
--
Disaggregated data is essential to monitor equity in educational resources and outcomes. If only aggregate data is provided, we are unable to distinguish how different groups of students are doing and what support they need. With disaggregated data, we can identify where solutions are needed to solve disparities in opportunity, resources, and treatment.
There are many publicly available aggregate datasets related to education.
For the purposes of this walkthrough, we will be looking at a particular school district’s data.
The district we focus on here reports their student demographics in a robust, complete way. Not only do they report the percentage of students in a subgroup, but they also include the number of students in each subgroup.
--
We will walk through how running analyses on data from a single district can help education data practitioners to understand and describe the landscape of needs and opportunities present.
We will use descriptive analysis on an aggregate dataset to find out whether there is a phenomenon present, what it is, and what may be worth trying to address through future supports, reforms, or interventions (Loeb et al., 2017).
library(tidyverse) library(here) library(janitor) library(dataedu) #library(tabulizer) #might cause issues on install
--
# Get data using {dataedu} race_pdf <- dataedu::race_pdf
race_df <- race_pdf %>% # Turn each page into a tibble map(~ as_tibble(.x, .name_repair = "unique")) %>% # Make data frame and remove unnecessary rows map_df(~ slice(.,-1:-2)) %>% # Use descriptive column names set_names( c( "school_group", "school_name", "grade", "na_num", # Native American number of students "na_pct", # Native American percentage of students "aa_num", # African American number of students "aa_pct", # African American percentage "as_num", # Asian number of students "as_pct", # Asian percentage "hi_num", # Hispanic number of students "hi_pct", # Hispanic percentage "wh_num", # White number of students "wh_pct", # White percentage "pi_pct", # Pacific Islander percentage "blank_col", "tot" # Total number of students (from the Race PDF) ) )
tail(race_df, 8)
race_df2 <- race_df %>% # Remove unnecessary columns select(-school_group, -grade, -pi_pct, -blank_col) %>% # Filter to get grade-level numbers filter(str_detect(school_name, "Total"), school_name != "Grand Total") %>% # Clean up school names mutate(school_name = str_replace(school_name, "Total", "")) %>% # Remove white space mutate_if(is.character, trimws) %>% # Turn percentage columns into numeric and decimal format mutate_at(vars(contains("pct")), list( ~ as.numeric(str_replace(., "%", "")) / 100))
# Get data using {dataedu} frpl_pdf <- dataedu::frpl_pdf
frpl_df <- frpl_pdf %>% # Turn each page into a tibble map(~ as_tibble(.x, .name_repair = "unique")) %>% # Make data frame and remove unnecessary rows map_df( ~ slice(.,-1)) %>% # Use descriptive column names set_names( c( "school_name", "not_eligible_num", # Number of non-eligible students, "reduce_num", # Number of students receiving reduced price lunch "free_num", # Number of students receiving free lunch "frpl_num", # Total number of students (from the FRPL PDF) "frpl_pct" # Free/reduced price lunch percentage ) )
frpl_df[42:47,]
frpl_df2 <- frpl_df %>% filter( # Remove blanks school_name != "", # Filter out the rows in this list !school_name %in% c( "ELM K_08", "Mid Schl", "High Schl", "Alt HS", "Spec Ed Total", "Cont Alt Total", "Hospital Sites Total", "Dist Total" ) ) %>% # Turn percentage columns into numeric and decimal format mutate(frpl_pct = as.numeric(str_replace(frpl_pct, "%", "")) / 100)
# create full dataset, joined by school name joined_df <- left_join(race_df2, frpl_df2, by = c("school_name")) %>% mutate_at(2:17, as.numeric)
Note: The total number of students from the Race/Ethnicity table does not match the total number of students from the FRPL table, even though they’re referring to the same districts in the same year. Why?
We want to calculate, for each race, the number of students in ‘high poverty’ schools. This is defined by NCES as schools that are over 75% FRPL (Education Statistics U.S. Department of Education, 2019). When a school is over 75% FRPL, we count the number of students for that particular race under the variable [racename]_povnum.
The {janitor} package has a handy adorn_totals() function that sums columns for you. This is important because we want a weighted average of students in each category, so we need the total number of students in each group.
We create the weighted average of the percentage of each race by dividing the number of students by race by the total number of students.
To get FRPL percentage for all schools, we have to recalculate frpl_pct (otherwise, it would not be a weighted average).
district_merged_df <- joined_df %>% # Calculate high poverty numbers mutate( hi_povnum = case_when(frpl_pct > .75 ~ hi_num), aa_povnum = case_when(frpl_pct > .75 ~ aa_num), wh_povnum = case_when(frpl_pct > .75 ~ wh_num), as_povnum = case_when(frpl_pct > .75 ~ as_num), na_povnum = case_when(frpl_pct > .75 ~ na_num) ) %>% adorn_totals() %>% # Create percentage by demographic mutate( na_pct = na_num / tot, aa_pct = aa_num / tot, as_pct = as_num / tot, hi_pct = hi_num / tot, wh_pct = wh_num / tot, frpl_pct = (free_num + reduce_num) / frpl_num, # Create percentage by demographic and poverty hi_povsch = hi_povnum / hi_num[which(school_name == "Total")], aa_povsch = aa_povnum / aa_num[which(school_name == "Total")], as_povsch = as_povnum / as_num[which(school_name == "Total")], wh_povsch = wh_povnum / wh_num[which(school_name == "Total")], na_povsch = na_povnum / na_num[which(school_name == "Total")] )
district_tidy_df <- district_merged_df %>% pivot_longer( cols = -matches("school_name"), names_to = "category", values_to = "value" )
r chunk_reveal("discover-dist", break_type = 1, widths = c(110, 85), title="###9.8.1 Discovering Distributions")
district_tidy_df %>% # Filter for Total rows, since we want district-level information filter(school_name == "Total", str_detect(category, "pct"), category != "frpl_pct") %>% ggplot(aes(x = reorder(category, -value), y = value)) + geom_bar(stat = "identity", aes(fill = category)) + labs(title = "Percentage of Population by Subgroup", x = "Subgroup", y = "Percentage of Population") + scale_x_discrete( labels = c( "aa_pct" = "Black", "wh_pct" = "White", "hi_pct" = "Hispanic", "as_pct" = "Asian", "na_pct" = "Native Am." ) ) + # Makes labels present as percentages scale_y_continuous(labels = scales::percent) + scale_fill_dataedu() + theme_dataedu() + theme(legend.position = "none")
district_tidy_df %>% filter(category == "frpl_pct", school_name == "Total")
r chunk_reveal("analyzing-spread", break_type = 1, widths = c(125, 92), title="###9.8.2 Analyzing Spread")
district_merged_df %>% # Remove district totals filter(school_name != "Total") %>% # X-axis: percentage of White students within schools ggplot(aes(x = wh_pct)) + geom_histogram(breaks = seq(0, 1, by = .1), fill = dataedu_colors("darkblue")) + labs(title = "Count of Schools by White Population", x = "White Percentage", y = "Count") + scale_x_continuous(labels = scales::percent) + theme(legend.position = "none") + theme_dataedu()
26 of the 74 (35%) of schools have between 0-10% White students. More than half of schools enroll fewer than 30% of White students even though White students make up 35% of the district student population.
class:middle
r chunk_reveal("poverty-subgroup", break_type = 1, widths = c(125, 100))
district_tidy_df %>% filter(school_name == "Total", str_detect(category, "povsch")) %>% ggplot(aes(x = reorder(category,-value), y = value)) + geom_bar(stat = "identity", aes(fill = factor(category))) + labs(title = "Subgroup Distribution in High Poverty Schools", x = "Subgroup", y = "Percentage in High Poverty Schools") + scale_x_discrete( labels = c( "aa_povsch" = "Black", "wh_povsch" = "White", "hi_povsch" = "Hispanic", "as_povsch" = "Asian", "na_povsch" = "Native Am." ) ) + scale_y_continuous(labels = scales::percent) + scale_fill_dataedu() + theme_dataedu() + theme(legend.position = "none")
8% of White students attend high poverty schools, compared to 43% of Black students, 39% of Hispanic students, 28% of Asian students, and 45% of Native American students. Non-White students are disproportionally attending high poverty schools.
r chunk_reveal("race-frpl-plot", break_type = 1, widths = c(100, 100), title="###9.9.2 Reveal Relationships")
district_merged_df %>% filter(school_name != "Total") %>% ggplot(aes(x = wh_pct, y = frpl_pct)) + geom_point(color = dataedu_colors("green")) + labs(title = "FRPL Percentage vs. White Percentage", x = "White Percentage", y = "FRPL Percentage") + scale_y_continuous(labels = scales::percent) + scale_x_continuous(labels = scales::percent) + theme_dataedu() + theme(legend.position = "none")
There exists a distribution of race/ethnicity within schools that are not representative of the district.
Students of color are overrepresented in high poverty schools.
There is a negative relationship between the percentage of White students in a school and the percentage of students eligible for FRPL.
--
Disaggregating aggregate data can allow us to showcase the inequity in a system and suggest interventions for what we can do to improve the situation in the district.
Research shows that racial and socioeconomic diversity in schools can provide students with a range of cognitive and social benefits. Therefore, the deep segregation that exists in the district can have adverse effects on students. Furthermore, high-poverty schools may lack other educational resources that are available in low-poverty schools.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.