knitr::opts_chunk$set(fig.width = 6, message = FALSE, warning = FALSE, comment = "", cache = F)

```{css, eval = TRUE, echo = FALSE} .remark-code, .remark-code-inline{font-size: 80%}, @media print { .has-continuation { display: block; } }

```r
library(flipbookr)
library(knitr) #not part of walkthrough

9.1 Topics Emphasized

9.2 Functions Introduced


9.3 Vocabulary


9.4 Overview

This chapter explores what aggregate data is, and how to access, clean, and explore it.

9.4.1 Background

Aggregate data refers to numerical information (or non-numerical information, such as the names of districts or schools) that has the following characteristics:

  1. collected from multiple sources and/or on multiple measures, variables, or individuals
  2. compiled into data summaries or summary reports, typically for public reporting or statistical analysis

In this chapter, we’ll focus on educational equity by identifying and comparing patterns in student demographic groups.


What is the difference between aggregate and student-level data?

--

library(tidyverse)

# Create student-level data 
tibble(
  student = letters[1:10],
  school = rep(letters[11:15], 2),
  test_score = sample(0:100, 10, replace = TRUE)
)

class: middle

Aggregation

Aggregate data totals up test_score to “hide” the student-level information. The rows of the resulting dataset represent the group level, here school.


tibble(
  student = letters[1:10],
  school = rep(letters[11:15], 2),
  test_score = sample(0:100, 10, replace = TRUE)
) %>%
  # Aggregate by school
  group_by(school) %>%
  summarize(mean_score = mean(test_score))

Disaggregating Aggregated Data

Common disaggregations for students include gender, race/ethnicity, socioeconomic status, English learner designation, and whether they are served under the Individuals with Disabilities Education Act (IDEA)

--

Disaggregating Data and Equity

Disaggregated data is essential to monitor equity in educational resources and outcomes. If only aggregate data is provided, we are unable to distinguish how different groups of students are doing and what support they need. With disaggregated data, we can identify where solutions are needed to solve disparities in opportunity, resources, and treatment.


Data Sources

There are many publicly available aggregate datasets related to education.

International level:

Federal:

State and district:


Selecting Data

For the purposes of this walkthrough, we will be looking at a particular school district’s data.

The district we focus on here reports their student demographics in a robust, complete way. Not only do they report the percentage of students in a subgroup, but they also include the number of students in each subgroup.

--

9.5.1 Methods

We will walk through how running analyses on data from a single district can help education data practitioners to understand and describe the landscape of needs and opportunities present.

We will use descriptive analysis on an aggregate dataset to find out whether there is a phenomenon present, what it is, and what may be worth trying to address through future supports, reforms, or interventions (Loeb et al., 2017).


9.6 Load Packages

library(tidyverse)
library(here)
library(janitor)
library(dataedu)
#library(tabulizer) #might cause issues on install

--

9.7 Import Data

  1. We can use {tabulizer}, which pulls the PDF data into lists using extract_tables().
  2. We can get the data from the book’s Github repository
  3. We can get the data from the {dataedu} package.

# Get data using {dataedu}
race_pdf <-
  dataedu::race_pdf

race_df <-
  race_pdf %>%
  # Turn each page into a tibble
  map(~ as_tibble(.x, .name_repair = "unique")) %>% 
  # Make data frame and remove unnecessary rows
  map_df(~ slice(.,-1:-2)) %>%
  # Use descriptive column names
  set_names(
    c(
      "school_group",
      "school_name",
      "grade",
      "na_num", # Native American number of students
      "na_pct", # Native American percentage of students
      "aa_num", # African American number of students
      "aa_pct", # African American percentage
      "as_num", # Asian number of students
      "as_pct", # Asian percentage
      "hi_num", # Hispanic number of students
      "hi_pct", # Hispanic percentage
      "wh_num", # White number of students
      "wh_pct", # White percentage
      "pi_pct", # Pacific Islander percentage
      "blank_col",
      "tot" # Total number of students (from the Race PDF)
    )
  )

Data frame exploration:

tail(race_df, 8)
race_df2 <-
  race_df %>%
  # Remove unnecessary columns
  select(-school_group, -grade, -pi_pct, -blank_col) %>%
  # Filter to get grade-level numbers
  filter(str_detect(school_name, "Total"),
         school_name != "Grand Total") %>%
  # Clean up school names
  mutate(school_name = str_replace(school_name, "Total", "")) %>%
  # Remove white space
  mutate_if(is.character, trimws) %>%
  # Turn percentage columns into numeric and decimal format
  mutate_at(vars(contains("pct")), list( ~ as.numeric(str_replace(., "%", "")) / 100))

Import Free Reduced Price Lunch (FRPL) PDFs

# Get data using {dataedu}
frpl_pdf <-
  dataedu::frpl_pdf
frpl_df <-
  frpl_pdf %>%
  # Turn each page into a tibble
  map(~ as_tibble(.x, .name_repair = "unique")) %>% 
  # Make data frame and remove unnecessary rows
  map_df( ~ slice(.,-1)) %>%
  # Use descriptive column names
  set_names(
    c(
      "school_name",
      "not_eligible_num", # Number of non-eligible students,
      "reduce_num", # Number of students receiving reduced price lunch
      "free_num",   # Number of students receiving free lunch
      "frpl_num",  # Total number of students (from the FRPL PDF)
      "frpl_pct" # Free/reduced price lunch percentage
    )
  )

frpl_df[42:47,]
frpl_df2 <-
  frpl_df %>%
  filter(
    # Remove blanks
    school_name != "",
    # Filter out the rows in this list
    !school_name %in% c(
      "ELM K_08",
      "Mid Schl",
      "High Schl",
      "Alt HS",
      "Spec Ed Total",
      "Cont Alt Total",
      "Hospital Sites Total",
      "Dist Total"
    )
  ) %>%
  # Turn percentage columns into numeric and decimal format
  mutate(frpl_pct = as.numeric(str_replace(frpl_pct, "%", "")) / 100)

Join datasets

# create full dataset, joined by school name
joined_df <-
  left_join(race_df2, frpl_df2, by = c("school_name")) %>%
  mutate_at(2:17, as.numeric)

Note: The total number of students from the Race/Ethnicity table does not match the total number of students from the FRPL table, even though they’re referring to the same districts in the same year. Why?


Next steps

  1. We want to calculate, for each race, the number of students in ‘high poverty’ schools. This is defined by NCES as schools that are over 75% FRPL (Education Statistics U.S. Department of Education, 2019). When a school is over 75% FRPL, we count the number of students for that particular race under the variable [racename]_povnum.

  2. The {janitor} package has a handy adorn_totals() function that sums columns for you. This is important because we want a weighted average of students in each category, so we need the total number of students in each group.

  3. We create the weighted average of the percentage of each race by dividing the number of students by race by the total number of students.

  4. To get FRPL percentage for all schools, we have to recalculate frpl_pct (otherwise, it would not be a weighted average).

5. To calculate the percentage of students by race who are in high poverty schools, we must divide the number of students in high poverty schools by the total number of students in that race.

district_merged_df <-
  joined_df %>%
  # Calculate high poverty numbers
  mutate(
    hi_povnum = case_when(frpl_pct > .75 ~ hi_num),
    aa_povnum = case_when(frpl_pct > .75 ~ aa_num),
    wh_povnum = case_when(frpl_pct > .75 ~ wh_num),
    as_povnum = case_when(frpl_pct > .75 ~ as_num),
    na_povnum = case_when(frpl_pct > .75 ~ na_num)
  ) %>%
  adorn_totals() %>%
  # Create percentage by demographic
  mutate(
    na_pct = na_num / tot,
    aa_pct = aa_num / tot,
    as_pct = as_num / tot,
    hi_pct = hi_num / tot,
    wh_pct = wh_num / tot,
    frpl_pct = (free_num + reduce_num) / frpl_num,
    # Create percentage by demographic and poverty
    hi_povsch = hi_povnum / hi_num[which(school_name == "Total")],
    aa_povsch = aa_povnum / aa_num[which(school_name == "Total")],
    as_povsch = as_povnum / as_num[which(school_name == "Total")],
    wh_povsch = wh_povnum / wh_num[which(school_name == "Total")],
    na_povsch = na_povnum / na_num[which(school_name == "Total")]
  )

Tidying

district_tidy_df <-
  district_merged_df %>%
  pivot_longer(
    cols = -matches("school_name"),
    names_to = "category",
    values_to = "value"
  )

r chunk_reveal("discover-dist", break_type = 1, widths = c(110, 85), title="###9.8.1 Discovering Distributions")

district_tidy_df %>%
  # Filter for Total rows, since we want district-level information
  filter(school_name == "Total",
         str_detect(category, "pct"),
         category != "frpl_pct") %>%
  ggplot(aes(x = reorder(category, -value), y = value)) +
  geom_bar(stat = "identity", aes(fill = category)) +
  labs(title = "Percentage of Population by Subgroup",
       x = "Subgroup",
       y = "Percentage of Population") +
  scale_x_discrete(
    labels = c(
      "aa_pct" = "Black",
      "wh_pct" = "White",
      "hi_pct" = "Hispanic",
      "as_pct" = "Asian",
      "na_pct" = "Native Am."
    )
  ) + # Makes labels present as percentages
  scale_y_continuous(labels = scales::percent) + 
  scale_fill_dataedu() +
  theme_dataedu() +
  theme(legend.position = "none")

Almost 40% of students are Black and around 36% are White.

Eligibility for FRPL

district_tidy_df %>%
  filter(category == "frpl_pct",
         school_name == "Total")

We see 56.9% of the students are eligible for FRPL compared to the U.S. average of 52.1%. This matches the PDFs.

r chunk_reveal("analyzing-spread", break_type = 1, widths = c(125, 92), title="###9.8.2 Analyzing Spread")

district_merged_df %>%
  # Remove district totals
  filter(school_name != "Total") %>%
  # X-axis: percentage of White students within schools
  ggplot(aes(x = wh_pct)) +
  geom_histogram(breaks = seq(0, 1, by = .1),
                 fill = dataedu_colors("darkblue"))  +
  labs(title = "Count of Schools by White Population",
       x = "White Percentage",
       y = "Count") +
  scale_x_continuous(labels = scales::percent) + 
  theme(legend.position = "none") +
  theme_dataedu()

26 of the 74 (35%) of schools have between 0-10% White students. More than half of schools enroll fewer than 30% of White students even though White students make up 35% of the district student population.


class:middle

The school race demographics are not representative of the district populations but does that hold for socioeconomic status as well?


9.9 Analysis

9.9.1 Creating Categories

High-poverty schools are defined as public schools where more than 75% of the students are eligible for FRPL. According to NCES, 24% of public school students attended high-poverty schools. However, different subgroups are overrepresented and underrepresented within the high poverty schools. Is this the case for this district?

r chunk_reveal("poverty-subgroup", break_type = 1, widths = c(125, 100))

district_tidy_df %>%
  filter(school_name == "Total",
         str_detect(category, "povsch")) %>%
  ggplot(aes(x = reorder(category,-value), y = value)) +
  geom_bar(stat = "identity", aes(fill = factor(category))) +
  labs(title = "Subgroup Distribution in High Poverty Schools",
       x = "Subgroup",
       y = "Percentage in High Poverty Schools") +
  scale_x_discrete(
    labels = c(
      "aa_povsch" = "Black",
      "wh_povsch" = "White",
      "hi_povsch" = "Hispanic",
      "as_povsch" = "Asian",
      "na_povsch" = "Native Am."
    )
  ) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_dataedu() +
  theme_dataedu() +
  theme(legend.position = "none")

8% of White students attend high poverty schools, compared to 43% of Black students, 39% of Hispanic students, 28% of Asian students, and 45% of Native American students. Non-White students are disproportionally attending high poverty schools.


r chunk_reveal("race-frpl-plot", break_type = 1, widths = c(100, 100), title="###9.9.2 Reveal Relationships")

district_merged_df %>%
  filter(school_name != "Total") %>%
  ggplot(aes(x = wh_pct, y = frpl_pct)) +
  geom_point(color = dataedu_colors("green")) +
  labs(title = "FRPL Percentage vs. White Percentage",
       x = "White Percentage",
       y = "FRPL Percentage") +
  scale_y_continuous(labels = scales::percent) +
  scale_x_continuous(labels = scales::percent) +
  theme_dataedu() +
  theme(legend.position = "none")

9.10 Results

  1. There exists a distribution of race/ethnicity within schools that are not representative of the district.

  2. Students of color are overrepresented in high poverty schools.

  3. There is a negative relationship between the percentage of White students in a school and the percentage of students eligible for FRPL.

--

9.11 Conclusion

Disaggregating aggregate data can allow us to showcase the inequity in a system and suggest interventions for what we can do to improve the situation in the district.

Research shows that racial and socioeconomic diversity in schools can provide students with a range of cognitive and social benefits. Therefore, the deep segregation that exists in the district can have adverse effects on students. Furthermore, high-poverty schools may lack other educational resources that are available in low-poverty schools.



r4ds/bookclub-dsieur documentation built on May 20, 2022, 6:24 p.m.