knitr::opts_chunk$set(tidy = FALSE,
                      echo = FALSE, 
                      message = FALSE, 
                      warning = FALSE)

Introduction

library(williams)
library(dplyr)
data("graduates")
graduates_details <- create_graduates(complete = TRUE)

The purpose of this document is to describe the procedure for the creating the faculty and graduate datasets. The data is collected from archives of the College's course catalogs, from r min(graduates$year) to r max(graduates$year).

Graduates

We begin with graduates. Our package provides two datasets: graduates and graduates_details. The graduates dataset provides the following variables:

names(graduates) %>% as.data.frame()

The graduates_details dataset expands on the above, and provides the following extra columns:

 names(graduates_details)[! names(graduates_details) %in%  names(graduates)] %>% as.data.frame()

For details on these variables, try ?graduates and ?graduates_details.

The datasets are created by scraping data from the course catalogs which are saved as .txt files in the inst/extdata directory. These files follow the naming convention of graduates-<YEAR (YYYY)>-<YEAR + 1 (YYYY)>, where YEAR is the year for the relevant catalog. For example, the text file with information about graduates from 2000 is saved as graduates-2000-2001.txt. The reason for this awkward naming convention is that the graduates from (spring) 2000 are listed in the course catalog which comes out in fall of 2000, which is the catalog for the 2000-2001 academic year. For convenience, we only save the pages containing information about graduates from the course catalogs to these files.

Let us look inside these text files. The following is an section from graduates-2000-2001.txt. In what follows, we describe the procedure for adding more data to the package. The example will help illustrate our steps.

Bachelor of Arts, Summa Cum Laude
*DoHyun Tony Chung, with honors in Political Economy
+*Rebecca Tamar Cover, with highest honors in Astrophysics
*Amanda Bouvier Edmonds
*Douglas Bertrand Marshall, with highest honors in Philosophy
+*Michelle Pacholec, with honors in Chemistry
*Grace Martha Pritchard, with honors in English
*Michael Vernon Ramberg
+*Taylor Frances Schildgen, with highest honors in Geosciences
*Qiang Sun
*Laura Susan Trice
*Max McMeekin Weinstein, with highest honors in Philosophy
*Catherine Anne Williams, with highest honors in History
Bachelor of Arts, Magna Cum Laude
David Scott Adams
*Julianne Humphrey Anderson
*Michael Zubrow Barkin
Robert Charles Blackstone
*Marlin Chu
*Sarah Ann Cohen, with highest honors in English
Mark Douglas Conrad
*Ellen Griswold Cook
*Mary Bowman Cummins
*Yana Dadiomova

In order to use the package to create the graduates and graduates_details datasets for subsequent years, follow the following steps:

  1. Add information about graduates from the course catalogs to text files. Follow the naming convention of graduates-<YEAR (YYYY)>-<YEAR + 1 (YYYY)>, where YEAR is the year for the relevant catalog. For example, the text file with information about graduates from 2000 is saved as graduates-2000-2001.txt. These files need to be saved in the inst/extdata directory.

  2. Edit the text files to follow the munging syntax. Such editing may often require getting your hands dirty with the text files. We describe the syntax below.

Bachelor of Arts, Summa Cum Laude
<Information about graduates with Summa Cum Laude latin honors> 
Bachelor of Arts, Magna Cum Laude
<Information about graduates with Magna Cum Laude latin honors> 
Bachelor of Arts, Cum Laude
<Information about graduates with Cum Laude latin honors> 
Bachelor of Arts
<Information about graduates with no latin honors> 

First, notice the lines "Bachelor of Arts, Summa Cum Laude", "Bachelor of Arts, Magna Cum Laude", "Bachelor of Arts, Cum Laude", and, "Bachelor of Arts". These lines serve as demarcations between different latin honors. Therefore, information about all graduates between the lines "Bachelor of Arts, Summa Cum Laude" and "Bachelor of Arts, Magna Cum Laude", are that about students that graduated with Summa Cum Laude. Similarly, lines between "Bachelor of Arts, Magna Cum Laude" and "Bachelor of Arts, Cum Laude", describe students which graduated with Magna Cum Laude, and so on.

All other lines in these text files contains information about a single graduate. These lines are organized as:

<name>, <honors information>

Here, <name> is just the name of the graduate, for example, "DoHyun Tony Chung".

<honors information> provides information about department honors earned by the graduate (if any). They are of the form with <honors level> in <department>. For example, "with honors in Political Economy" or "with highest honors in Mathematics". If the graduate has received honors from more than one department, the information about each is delimitted by a comma (<information about department honor 1>, <information about department honor 2>). For example, "with honors in Political Economy, with highest honors in Mathematics".

Finally, add information about Phi Beta Kappa & Sigma Xi. Membership of Phi Beta Kappa is indicated by a "*" before the graduate's name, and that of Sigma Xi Kappa by a "+" before the graduate's name.

Then, for example, information about a graduate takes the form: "*DoHyun Tony Chung, with honors in Political Economy".

  1. Recreate the genderizeR dataset for new graduates, and save it in sysdata.rda. For more information on how to achieve this, refer to the genderizeR documentation.

  2. Create and save the datasets.

graduates <- create_graduates(complete = FALSE)
graduates_details <- create_graduates(complete = TRUE) save(graduates, file = .../data/graduates.RData) save(graduates_details, file = .../data/graduates_details.RData)



karantibrewal/williams documentation built on May 3, 2019, 9:40 p.m.