In r4ds/bookclub-dsieur: Data Science in Education Using R Book Club

knitr::opts_chunk$set(fig.width = 6, message = FALSE, warning = FALSE, comment = "", cache = F)

```{css, eval = TRUE, echo = FALSE} .remark-code, .remark-code-inline{font-size: 80%}, @media print { .has-continuation { display: block; } }

```r
library(flipbookr)
library(knitr) #not part of walkthrough

9.1 Topics Emphasized

Importing data
Tidying data
Transforming data
Visualizing data

9.2 Functions Introduced

dplyr::mutate_at()
readRDS()
purrr::map and purrr::map_df()
purrr::set_names()
dplyr::slice()

9.3 Vocabulary

aggregate data
disaggregated data
data frame
Free/Reduced Price Lunch (FRPL)
histogram
lists
subgroup
trim
weighted average

9.4 Overview

This chapter explores what aggregate data is, and how to access, clean, and explore it.

9.4.1 Background

Aggregate data refers to numerical information (or non-numerical information, such as the names of districts or schools) that has the following characteristics:

collected from multiple sources and/or on multiple measures, variables, or individuals
compiled into data summaries or summary reports, typically for public reporting or statistical analysis

In this chapter, we’ll focus on educational equity by identifying and comparing patterns in student demographic groups.

What is the difference between aggregate and student-level data?

library(tidyverse)

# Create student-level data 
tibble(
  student = letters[1:10],
  school = rep(letters[11:15], 2),
  test_score = sample(0:100, 10, replace = TRUE)
)

class: middle

Aggregation

Aggregate data totals up test_score to “hide” the student-level information. The rows of the resulting dataset represent the group level, here school.

tibble(
  student = letters[1:10],
  school = rep(letters[11:15], 2),
  test_score = sample(0:100, 10, replace = TRUE)
) %>%
  # Aggregate by school
  group_by(school) %>%
  summarize(mean_score = mean(test_score))

Disaggregating Aggregated Data

Common disaggregations for students include gender, race/ethnicity, socioeconomic status, English learner designation, and whether they are served under the Individuals with Disabilities Education Act (IDEA)

Disaggregating Data and Equity

Disaggregated data is essential to monitor equity in educational resources and outcomes. If only aggregate data is provided, we are unable to distinguish how different groups of students are doing and what support they need. With disaggregated data, we can identify where solutions are needed to solve disparities in opportunity, resources, and treatment.

Data Sources

There are many publicly available aggregate datasets related to education.

International level:

Programme for International Student Assessment (PISA)

Federal:

Civil Rights Data Collection (CRDC)
Common Core of Data (CCD)
EdFacts
Integrated Postsecondary Education Data System (IPEDS)
National Assessment for Educational Progress (NAEP)

State and district:

California Department of Education
Minneapolis Public Schools

Selecting Data

For the purposes of this walkthrough, we will be looking at a particular school district’s data.

The district we focus on here reports their student demographics in a robust, complete way. Not only do they report the percentage of students in a subgroup, but they also include the number of students in each subgroup.

9.5.1 Methods

We will walk through how running analyses on data from a single district can help education data practitioners to understand and describe the landscape of needs and opportunities present.

We will use descriptive analysis on an aggregate dataset to find out whether there is a phenomenon present, what it is, and what may be worth trying to address through future supports, reforms, or interventions (Loeb et al., 2017).

9.6 Load Packages

library(tidyverse)
library(here)
library(janitor)
library(dataedu)
#library(tabulizer) #might cause issues on install

9.7 Import Data

We can use {tabulizer}, which pulls the PDF data into lists using extract_tables().
We can get the data from the book’s Github repository
We can get the data from the {dataedu} package.

# Get data using {dataedu}
race_pdf <-
  dataedu::race_pdf

race_df <-
  race_pdf %>%
  # Turn each page into a tibble
  map(~ as_tibble(.x, .name_repair = "unique")) %>% 
  # Make data frame and remove unnecessary rows
  map_df(~ slice(.,-1:-2)) %>%
  # Use descriptive column names
  set_names(
    c(
      "school_group",
      "school_name",
      "grade",
      "na_num", # Native American number of students
      "na_pct", # Native American percentage of students
      "aa_num", # African American number of students
      "aa_pct", # African American percentage
      "as_num", # Asian number of students
      "as_pct", # Asian percentage
      "hi_num", # Hispanic number of students
      "hi_pct", # Hispanic percentage
      "wh_num", # White number of students
      "wh_pct", # White percentage
      "pi_pct", # Pacific Islander percentage
      "blank_col",
      "tot" # Total number of students (from the Race PDF)
    )
  )

Data frame exploration:

tail(race_df, 8)

race_df2 <-
  race_df %>%
  # Remove unnecessary columns
  select(-school_group, -grade, -pi_pct, -blank_col) %>%
  # Filter to get grade-level numbers
  filter(str_detect(school_name, "Total"),
         school_name != "Grand Total") %>%
  # Clean up school names
  mutate(school_name = str_replace(school_name, "Total", "")) %>%
  # Remove white space
  mutate_if(is.character, trimws) %>%
  # Turn percentage columns into numeric and decimal format
  mutate_at(vars(contains("pct")), list( ~ as.numeric(str_replace(., "%", "")) / 100))

Import Free Reduced Price Lunch (FRPL) PDFs

# Get data using {dataedu}
frpl_pdf <-
  dataedu::frpl_pdf

frpl_df <-
  frpl_pdf %>%
  # Turn each page into a tibble
  map(~ as_tibble(.x, .name_repair = "unique")) %>% 
  # Make data frame and remove unnecessary rows
  map_df( ~ slice(.,-1)) %>%
  # Use descriptive column names
  set_names(
    c(
      "school_name",
      "not_eligible_num", # Number of non-eligible students,
      "reduce_num", # Number of students receiving reduced price lunch
      "free_num",   # Number of students receiving free lunch
      "frpl_num",  # Total number of students (from the FRPL PDF)
      "frpl_pct" # Free/reduced price lunch percentage
    )
  )

frpl_df[42:47,]

frpl_df2 <-
  frpl_df %>%
  filter(
    # Remove blanks
    school_name != "",
    # Filter out the rows in this list
    !school_name %in% c(
      "ELM K_08",
      "Mid Schl",
      "High Schl",
      "Alt HS",
      "Spec Ed Total",
      "Cont Alt Total",
      "Hospital Sites Total",
      "Dist Total"
    )
  ) %>%
  # Turn percentage columns into numeric and decimal format
  mutate(frpl_pct = as.numeric(str_replace(frpl_pct, "%", "")) / 100)

Join datasets

# create full dataset, joined by school name
joined_df <-
  left_join(race_df2, frpl_df2, by = c("school_name")) %>%
  mutate_at(2:17, as.numeric)

Note: The total number of students from the Race/Ethnicity table does not match the total number of students from the FRPL table, even though they’re referring to the same districts in the same year. Why?

Next steps

We want to calculate, for each race, the number of students in ‘high poverty’ schools. This is defined by NCES as schools that are over 75% FRPL (Education Statistics U.S. Department of Education, 2019). When a school is over 75% FRPL, we count the number of students for that particular race under the variable [racename]_povnum.
The {janitor} package has a handy adorn_totals() function that sums columns for you. This is important because we want a weighted average of students in each category, so we need the total number of students in each group.
We create the weighted average of the percentage of each race by dividing the number of students by race by the total number of students.
To get FRPL percentage for all schools, we have to recalculate frpl_pct (otherwise, it would not be a weighted average).

5. To calculate the percentage of students by race who are in high poverty schools, we must divide the number of students in high poverty schools by the total number of students in that race.

district_merged_df <-
  joined_df %>%
  # Calculate high poverty numbers
  mutate(
    hi_povnum = case_when(frpl_pct > .75 ~ hi_num),
    aa_povnum = case_when(frpl_pct > .75 ~ aa_num),
    wh_povnum = case_when(frpl_pct > .75 ~ wh_num),
    as_povnum = case_when(frpl_pct > .75 ~ as_num),
    na_povnum = case_when(frpl_pct > .75 ~ na_num)
  ) %>%
  adorn_totals() %>%
  # Create percentage by demographic
  mutate(
    na_pct = na_num / tot,
    aa_pct = aa_num / tot,
    as_pct = as_num / tot,
    hi_pct = hi_num / tot,
    wh_pct = wh_num / tot,
    frpl_pct = (free_num + reduce_num) / frpl_num,
    # Create percentage by demographic and poverty
    hi_povsch = hi_povnum / hi_num[which(school_name == "Total")],
    aa_povsch = aa_povnum / aa_num[which(school_name == "Total")],
    as_povsch = as_povnum / as_num[which(school_name == "Total")],
    wh_povsch = wh_povnum / wh_num[which(school_name == "Total")],
    na_povsch = na_povnum / na_num[which(school_name == "Total")]
  )

Tidying

district_tidy_df <-
  district_merged_df %>%
  pivot_longer(
    cols = -matches("school_name"),
    names_to = "category",
    values_to = "value"
  )

r chunk_reveal("discover-dist", break_type = 1, widths = c(110, 85), title="###9.8.1 Discovering Distributions")

district_tidy_df %>%
  # Filter for Total rows, since we want district-level information
  filter(school_name == "Total",
         str_detect(category, "pct"),
         category != "frpl_pct") %>%
  ggplot(aes(x = reorder(category, -value), y = value)) +
  geom_bar(stat = "identity", aes(fill = category)) +
  labs(title = "Percentage of Population by Subgroup",
       x = "Subgroup",
       y = "Percentage of Population") +
  scale_x_discrete(
    labels = c(
      "aa_pct" = "Black",
      "wh_pct" = "White",
      "hi_pct" = "Hispanic",
      "as_pct" = "Asian",
      "na_pct" = "Native Am."
    )
  ) + # Makes labels present as percentages
  scale_y_continuous(labels = scales::percent) + 
  scale_fill_dataedu() +
  theme_dataedu() +
  theme(legend.position = "none")

Almost 40% of students are Black and around 36% are White.

Eligibility for FRPL

district_tidy_df %>%
  filter(category == "frpl_pct",
         school_name == "Total")

We see 56.9% of the students are eligible for FRPL compared to the U.S. average of 52.1%. This matches the PDFs.

r chunk_reveal("analyzing-spread", break_type = 1, widths = c(125, 92), title="###9.8.2 Analyzing Spread")

district_merged_df %>%
  # Remove district totals
  filter(school_name != "Total") %>%
  # X-axis: percentage of White students within schools
  ggplot(aes(x = wh_pct)) +
  geom_histogram(breaks = seq(0, 1, by = .1),
                 fill = dataedu_colors("darkblue"))  +
  labs(title = "Count of Schools by White Population",
       x = "White Percentage",
       y = "Count") +
  scale_x_continuous(labels = scales::percent) + 
  theme(legend.position = "none") +
  theme_dataedu()

26 of the 74 (35%) of schools have between 0-10% White students. More than half of schools enroll fewer than 30% of White students even though White students make up 35% of the district student population.

class:middle

The school race demographics are not representative of the district populations but does that hold for socioeconomic status as well?

9.9 Analysis

9.9.1 Creating Categories

High-poverty schools are defined as public schools where more than 75% of the students are eligible for FRPL. According to NCES, 24% of public school students attended high-poverty schools. However, different subgroups are overrepresented and underrepresented within the high poverty schools. Is this the case for this district?

r chunk_reveal("poverty-subgroup", break_type = 1, widths = c(125, 100))

district_tidy_df %>%
  filter(school_name == "Total",
         str_detect(category, "povsch")) %>%
  ggplot(aes(x = reorder(category,-value), y = value)) +
  geom_bar(stat = "identity", aes(fill = factor(category))) +
  labs(title = "Subgroup Distribution in High Poverty Schools",
       x = "Subgroup",
       y = "Percentage in High Poverty Schools") +
  scale_x_discrete(
    labels = c(
      "aa_povsch" = "Black",
      "wh_povsch" = "White",
      "hi_povsch" = "Hispanic",
      "as_povsch" = "Asian",
      "na_povsch" = "Native Am."
    )
  ) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_dataedu() +
  theme_dataedu() +
  theme(legend.position = "none")

8% of White students attend high poverty schools, compared to 43% of Black students, 39% of Hispanic students, 28% of Asian students, and 45% of Native American students. Non-White students are disproportionally attending high poverty schools.

r chunk_reveal("race-frpl-plot", break_type = 1, widths = c(100, 100), title="###9.9.2 Reveal Relationships")

district_merged_df %>%
  filter(school_name != "Total") %>%
  ggplot(aes(x = wh_pct, y = frpl_pct)) +
  geom_point(color = dataedu_colors("green")) +
  labs(title = "FRPL Percentage vs. White Percentage",
       x = "White Percentage",
       y = "FRPL Percentage") +
  scale_y_continuous(labels = scales::percent) +
  scale_x_continuous(labels = scales::percent) +
  theme_dataedu() +
  theme(legend.position = "none")

9.10 Results

There exists a distribution of race/ethnicity within schools that are not representative of the district.
Students of color are overrepresented in high poverty schools.
There is a negative relationship between the percentage of White students in a school and the percentage of students eligible for FRPL.

9.11 Conclusion

Disaggregating aggregate data can allow us to showcase the inequity in a system and suggest interventions for what we can do to improve the situation in the district.

Research shows that racial and socioeconomic diversity in schools can provide students with a range of cognitive and social benefits. Therefore, the deep segregation that exists in the district can have adverse effects on students. Furthermore, high-poverty schools may lack other educational resources that are available in low-poverty schools.

r4ds/bookclub-dsieur documentation built on May 20, 2022, 6:24 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

r4ds/bookclub-dsieur
Data Science in Education Using R Book Club

In r4ds/bookclub-dsieur: Data Science in Education Using R Book Club

9.1 Topics Emphasized

9.2 Functions Introduced

9.3 Vocabulary

9.4 Overview

9.4.1 Background

What is the difference between aggregate and student-level data?

Aggregation

Disaggregating Aggregated Data

Disaggregating Data and Equity

Data Sources

International level:

Federal:

State and district:

Selecting Data

9.5.1 Methods

9.6 Load Packages

9.7 Import Data

We can get the data from the {dataedu} package.

Data frame exploration:

Import Free Reduced Price Lunch (FRPL) PDFs

Join datasets

Next steps

5. To calculate the percentage of students by race who are in high poverty schools, we must divide the number of students in high poverty schools by the total number of students in that race.

Tidying

Almost 40% of students are Black and around 36% are White.

Eligibility for FRPL

We see 56.9% of the students are eligible for FRPL compared to the U.S. average of 52.1%. This matches the PDFs.

The school race demographics are not representative of the district populations but does that hold for socioeconomic status as well?

9.9 Analysis

9.9.1 Creating Categories

9.10 Results

9.11 Conclusion

R Package Documentation

Browse R Packages

We want your feedback!

r4ds/bookclub-dsieur Data Science in Education Using R Book Club

In r4ds/bookclub-dsieur: Data Science in Education Using R Book Club

9.1 Topics Emphasized

9.2 Functions Introduced

9.3 Vocabulary

9.4 Overview

9.4.1 Background

What is the difference between aggregate and student-level data?

Aggregation

Disaggregating Aggregated Data

Disaggregating Data and Equity

Data Sources

International level:

Federal:

State and district:

Selecting Data

9.5.1 Methods

9.6 Load Packages

9.7 Import Data

We can get the data from the {dataedu} package.

Data frame exploration:

Import Free Reduced Price Lunch (FRPL) PDFs

Join datasets

Next steps

5. To calculate the percentage of students by race who are in high poverty schools, we must divide the number of students in high poverty schools by the total number of students in that race.

Tidying

Almost 40% of students are Black and around 36% are White.

Eligibility for FRPL

We see 56.9% of the students are eligible for FRPL compared to the U.S. average of 52.1%. This matches the PDFs.

The school race demographics are not representative of the district populations but does that hold for socioeconomic status as well?

9.9 Analysis

9.9.1 Creating Categories

9.10 Results

9.11 Conclusion

R Package Documentation

Browse R Packages

We want your feedback!

r4ds/bookclub-dsieur
Data Science in Education Using R Book Club