library(WilliamsStaff)
library(tm)
library(ggplot2)
# Sys.setlocale('LC_ALL','C')

Abstract

The WilliamsStaff package provides the ability to explore characteristics of Williams Faculty over time. This article details how to use the WilliamsStaff package, files included in the package, and provides sample usages of the package.

Raw Data

Information about the staff of Williams can be found online in archived course catalogs here. It was decided that only professors will be investigated, though data for other faculty at Williams is available.

The source data is in the form of a PDF. The tm package is used to extract text from the PDF. Of note, the command pdftotext from XPDF is the only non-R dependency for this package. The pdftotext command is a shell command which initiates the conversion process.

Simple Extraction

We call find_data_by_year to scan a given year's PDF for relevant content. For the sake of computing efficiency, the full-length PDFs have been trimmed down to more manageable size (on average, from around 400 pages to less than 20).

year = 2012 # any value %in% 2000:2013  will work
data = find_data_by_year(year)
head(data)

After the content is read into R, quality of life modifications are made: page numbers, blank pages, and headers are removed. What is left is lines of professor information. Of note, information about a particular professor might be found over multiple lines.

#information for Professor Ashraf is all on the same line and nowhere else
data[10]
#the information for Professor Benson is found on two adjacent lines
data[26]
data[27]

Combined Extraction

While being able to read any data from the PDFs is useful, it is more important that professors (and their relevant information) can readily be identified. The collect_faculty function weaves together the raw data output of find_data_by_year to provide a list of faculty from a given year with each element containing all of the information for one professor. The optional parameter append_year is a boolean which is defaulted to FALSE. If left unspecified, the output of the function resembles the data in the raw PDFS. However, if it is passed a value of TRUE, the function will append "|||YEAR" to the end of each faculty entry, where YEAR is the year from which data is being collected. This parameter is useful to prevent the year information from being lost when professors from multiple different years are being compared against each other.

faculty2013 = collect_faculty(2013, append_year = TRUE)
head(faculty2013)

There is a concern about the ability of collect_faculty to scale. The function is at the mercy of the chosen formatting of the faculty information within a PDF. For the 14 years for which data is available, all entries begin with the professor's name and title in that order. Luckily, for 2000-2012, the ordering of the information is relatively static. However, the order of data in the 2013 catalog is different. To allow for flexibility for the rest of the package, a standard order of information was adopted. The helper function general_format allows for a consistent template of professor information to be considered. That order is Name, Position, (degree, year, college), (degree, year, college), … .

A quick note on formatting

To select just the text from a faculty list, it is important to use [[]] notation, though [] also works, as seen below.

faculty2013[1]
faculty2013[[1]]
nchar(faculty2013[1])
nchar(faculty2013[[1]])

WilliamsStaff package limits

In terms of recognizing professor position titles and types of degrees, this package employs a white-listing system. Thus, if there is a typo in the original PDF, that can lead to malformed data. For example, consider

faculty2013 [[70]]

Note that data form Professor Curulla and Artist-in-Residence Dankmeyer are treated as a single entry. While stitching together a professor from potentially multiple lines, the collector scans for professor keywords such as "Professor", "Fellow", or "Lecturer". Upon noting that a line contains a keyword, a new entry is created.

Thus, data for the unrecognized faculty member will be combined with that of the sequentially previous faculty. This is potentially problematic in two ways: 1) the unrecognized faculty member will not be counted among the staff and 2) if, for example, the unrecognized faculty member has an MA and the previous faculty member does not have an MA, has_degree(RECOGNIZED_FACULTY, "MA") will yield a non-0 result.

Professor Statistics

Now that data can be transferred from a PDF to the R environment, we can begin to better understand our professor population. Ages and genders will be determined based mostly on information available in the course catalog PDFs.

Age

For the sake of exploration, we will suppose that people are 22 when they receive their undergraduate degree. The working list of undergraduate degrees can be found by calling get_ba_degrees(). If a professor has earned a degree, such as Bachelor of Imagination (BI), which is not on this list, then we are unable to provide an approximate age for such a professor.

faculty2013[[4]]
calculate_age(faculty2013[4])

Professor Albrecht received her BS in 2001, and we note (2016-2011)+22 = 37, as expected (and desired). calculate_age returns a sentinel value of 0 if no BA can be found. Thus, the number of professors with a recognized undergraduate degree can be discovered via

ages <- sapply(faculty2013, calculate_age)
ages <- ages[ages > 0] #0 is returned when no BA is found

The youngest professor can be found

youngest <- min(ages)
faculty2013[grep(youngest, ages)] #this will refer to all professors with a BA from that year

and similarly for the oldest professor:

oldest <- max(ages)
faculty2013[grep(oldest, ages)]

The average age can be found as

mean(ages)

Let's take a look at the full spread of the ages of Williams Professors.

df <- data.frame(age = ages)
ggplot(df, aes(age)) + geom_bar(color="lightblue", fill=I("lightblue")) + theme_classic() + geom_vline(xintercept = mean(ages), size=1.5) + geom_text(y =0, label="") + ggplot2::annotate("text", label = "average age", x=mean(ages)+15,y = 16, size = 5, color = "red")

Ages over time

We now have access to enough tools to take a wide look at the data. For instance, we can start to look at the changes in age distribution over time

staff <- c()
years <- 2000:2013
# years <- c(2000, 2008, 2013)
for(i in 1:length(years)){
    staff[[i]] <- WilliamsStaff::collect_faculty(years[i], append_year=TRUE)
}
average_ages <- sapply(staff, function(row) mean(sapply(row,calculate_age)))
se           <- sapply(staff, function(row) std_error(sapply(row,calculate_age)))

df <- data.frame(age = average_ages, error = se, year = years)

ggplot(df, aes(x=year, y = age)) + geom_errorbar(aes(ymin=age-error, ymax=age+error), width = .1) + 
  geom_point(size=5) + geom_line() +
  scale_x_continuous(breaks=df$year)+ theme_classic() +
  ggtitle(paste0("Average ages of Staff\nfrom ", years[1], " to ", years[length(years)]))

A slight increase in the average age of professors over time can be seen. This can indicate that Williams is retaining its staff for longer periods of time or that there is a bias in hiring of older (and presumably more experienced) candidates.

Overlap between concurrent years

To better understand the source of the increasing age of professors over time, the WilliamsStaff package provides a resource for comparing the staff of one year to another.

The overlap function takes two faculty lists as input. It is important that the year is appended at the end as different years have different formatting for the professor lists. Earlier years tend to have the name separated from the position by a series of spaces, while later years have punctuation. A professor is deemed to be in the overlap if their name appears on both lists. No correction for inclusion (or lack thereof) of middle initial has been made, but punctuation (such as "Daniel P. Aalberts" (2008) vs "Daniel P Aalberts" (2009)) is removed before comparisons are made.

Furthermore, the professors at Williams are known to take sabbaticals. The overlap function can be expanded to look at overlaps between non-consecutive years, but that is not explored in this paper.

We can take a look at the overlap in faculty between concurrent years as below.

overlaps <- c()
overlaps_norm <- c()
for(i in 1:(length(staff)-1)){
  overlaps[[i]] <- (WilliamsStaff:::overlap(staff[[i]], staff[[i+1]]))
  overlaps_norm[[i]] <- length(overlaps[[i]])/length(staff[[i+1]])
}

df <- data.frame(percent = overlaps_norm, year=years[-1])
ggplot(df, aes(x=year, y = percent)) + 
  geom_point(size=5) + geom_line() +
  scale_y_continuous(limits = c(0,1)) + 
  scale_x_continuous(breaks=df$year)+ theme_classic() +
  ggtitle(paste0("Overlap of Staff\nfrom ", years[1], " to ", years[length(years)]))

For the most part, it seems Williams has retained a fairly consistent 75 percent of its faculty. However, there is a relatively steep dropoff in 2013. Whether this is a statistical aberration or the result of institutional policy is worthy of further study.

Average ages of overlapping faculty

We have the ability to determine the ages of the overlapping faculty thanks to the search_name_in_list function. We simply supply the function a name and a list and we will have the index of the name. From there, we can calculate the age as we have done before.

mean_ages <- c()
for(i in 1:length(overlaps)){
  ages <- sapply(overlaps[[i]], function(person) calculate_age(staff[[i+1]][WilliamsStaff:::search_name_in_list(person, staff[[i+1]])]))
  ages <- unlist(ages)
  ages <- ages[ages > 0]
  mean_ages[i] <- mean(ages)
}
print(mean_ages)

With the average ages of the professors in the overlaps determined, we can now determine if the new staff are in general younger or older by subtracting the means. We would expect that the staff that overlaps in two consecutive years would be in general older than new staff hired, as new professors tend to be younger.

diff <- average_ages[-1] - mean_ages #remove the year 2000, as we normed overlaps to the later years
diff

As expected, for the most part, new staff is younger than the experienced staff. The difference in means for the year of 2013 is more than 2 interquartile ranges away, and can be considered an outlier. When this outlier is removed, we can take a regression on the remaining data points.

data=data.frame(x=2001:2012, y=diff[-length(diff)])

ggplot(data, aes(x=x, y=y))+ 
  geom_point(size=5) + theme_classic()+
  scale_x_continuous(breaks=data$x)+ geom_line()+geom_smooth(method="lm")+scale_y_discrete(limits=c(-5, 0))

In the distant past, there was a larger difference in ages between the staff that stayed to teach at Williams and the staff overall. 2013 is a statistical outlier, but it would be interesting to see if this trend continues. It would be absurd to suppose that the trend continues indefinitely, as it would be hard to imagine a world with youthful professors with years of experience at Williams complete with their older collegues who have just signed on.

Conclusion

Having a better understanding of the ages of the professors at Williams College can help us track patterns and address issues before they spiral out of control. We have discovered a population of professors that has slightly grown in age with time and furthermore that the gap between the mean ages of experienced professors and of professors overall is shrinking.

To determine the severity of these trends, it would be worthwhile to determine if factors such as gender or department are more predictive of changes in ages.



PhilBrockman/WilliamsStaff documentation built on May 8, 2019, 1:33 a.m.