apsahtml2csv: Read, parse, and write to a .csv file APSA eJobs html files

View source: R/apsahtml2csv.R

apsahtml2csvR Documentation

Read, parse, and write to a .csv file APSA eJobs html files

Description

Reads American Political Science Association (APSA) “eJobs” html files, parses the content of these files into a format for muRL to read, and writes that content to a .csv file.

Usage

apsahtml2csv(directory, file.name, file.ext = ".htm", verbose = TRUE)

Arguments

directory

a character string specifying the directory to which a set of APSA job announcement web pages have been downloaded.

file.name

a character string specifying the name of the file to which the data should be written.

file.ext

a character string specifying the extension of the files from which the data will be harvested.

verbose

a logical specifying whether the file name and current working directory should be printed.

Details

After logging in to eJobs, the job announcement site of the American Political Science Association (APSA), the user can search for and find the APSA web page announcing a single job listing. The user can download the html from several such pages (usually with a simple “Save As” command, depending on one's operating system). apsahtml2csv then parses the html code from these pages, and sorts and stores the relevant content. A .csv file is written containing this content.

If the user downloads the APSA webpages using a different (or no) file extension, that extension (or "") should be specified using the file.ext argument. Because apsahtml2csv uses the value of file.ext in a grep command, we strongly recommend that the directory specified by directory include only the downloaded webpages, and no other files or directories.

Institutions are inconsistent in how they enter the names of their jobs' contact representatives. Thus, some tweaking of the output of apsahtml2csv may be required in order to create a .csv file that can be seemlessly read by read.murl. Specifically, the user may have to take the single column of the output of apsahtml2csv called contact, and create columns called title, fname, and lname. Additionally, the user may have to adjust the position and subfield columns, and institutions may report these somewhat differently.

Value

An R dataframe is created and a .csv file is written. These include columns containing the APSA job listing ID number, the date the job advertisement was posted, the type of institution, the title and subfield of the position, the start date, salary, and region, the name of the institution and department, the name, address, city, state, ZIP code, and phone number of the individual to contact, the department or institution's web address, and a full paragraph description of the position.

The full paragraph description is stored in a column named desc. Due to the current parsing strategy, this field may include some excess characters from the APSA html page.

Author(s)

Ryan T. Moore rtm@american.edu and Andrew Reeves reeves@wustl.edu

See Also

read.murl


muRL documentation built on Aug. 22, 2023, 9:11 a.m.