Function to Match Error Prone Free-text to Standard School Names

Share:

Description

parseSchools finds best matching school name among several possible spellings & abbreviations. Matches are based on a three stages of parsing: stage (1) standardizes the text by removing common typos and spelling errors, stage (2) manually searches for common name variations for the same school, stage (3) uses an automated text processing algorithm to match the closest school name from a standardized list.

Usage

1
parseSchools( original_name, resolution = 10, map=FALSE )

Arguments

original_name

original_name denotes an Nx1 vector of university names read from the Grad Cafe.

resolution

resolution controls the precision required before an original name is replaced with the best standardized equivalent. Therefore, very low values (between 0-5) are cautious selections leading to fewer mis-matches, but more sparse results. Medium range values (8-12) lead to surprisingly accurate replacements when the mother processing stages fail. One might expect a few mis-matched name replacements, but the number of errors should be fairly low. Finally, large values (more than 20) practically guarantee that a school name which is not in our standard dictionary will be replaced with something. Be weary of such large selections; the potential for many mis-matched replacements is high. For the test set, the bulk of the nearest matchs were within 10 units of the original value. Almost none were larger than 30. The default value is 10.

map

map is a variable controlling whether or not the original school names are included in the data frame returned by brewdata(). If map=TRUE, then the returned data includes the parsed names as well as the original. The default value is map=FALSE.

Value

school_name

is the name of the university corresponding to the row of data. parseSchools normalizes the names reported on the website.

See Also

findScorePercentile, parseResults, parseSchools, translateScore, getGradCafeData, getMaxPages

Examples

1
2
3
x = c( "university of california--berkeley","university of california--berkly", 
	"uc berkeley", "berkeley" )
parseSchools( x )