knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(magutils)
library(magrittr)

The package operates on a relational database of bibliometric data from Microsoft Academic Graph and dissertation data from ProQuest. To illustrate the usage, the package ships with a small replicate of the database with the necessary tables. Let's load the database:

db_file <- db_example("AcademicGraph.sqlite")
conn <- connect_to_db(db_file)

The tbls output above shows the all tables in the database.

Introducing the database

Overview

The database contains some of the tables originally in MAG and custom tables created by us. The tables taken from MAG have the same name as in MAG; the schema to their database is here.

Naming conventions

The names of the tables in the database and the columns in the tables follow two different conventions. Tables from MAG are named with CamelCase, as in the original database; custom tables created by us are named with snake_case (with a few exceptions not relevant to the front-end). Columns in MAG tables are named with CamelCase; columns in custom tables can be named either with CamelCase or snake_case.

Tables from ProQuest data are indicated with a "pq_" prefix; they and their columns are named with snake_case.

Database structure

The following describes each table; columns irrelevant to the package are omitted.

Tables current_links and current_links_advisors

Tables with links between records in MAG and ProQuest. They are created outside of the package.

current_links has links between the authors dissertations and authors in MAG. current_links_advisors has links between the advisors of dissertations and authors in MAG.

The identifier in ProQuest differs:

Table FieldsOfStudy

The fields of study as classified by MAG. The relevant columns are

Table FirstNamesGender

Maps first names to gender; data from genderize.io. Columns:

Table pq_authors

Table with authors of PhD dissertations from ProQuest. The relevant columns are:

Table pq_fields_mag

The fields of the dissertation and their corresponding field (level 0) in MAG.

Table pq_unis

Universities in ProQuest

Table pq_advisors

The dissertation advisors.

Note that advisors do not have their own unique identifier.

Getting information on the database

We can query information from the database schema with some custom functions.

We can use the following query to see the specific contents of each table:

dplyr::glimpse(sqlite_master_to_df(conn = conn))

Using the package

Loading data from proquest: get_proquest

The function loads the dissertations written in the United States (available in ProQuest) between start_year and end_year. The underlying table used for the output depends on from and is thus either pq_authors when querying "graduates" or pq_advisors when querying "advisors". The final output of the function also depends on from, and in general the returned columns always refer to the respective units.

Let's see an example:

graduates <- get_proquest(conn, from = "graduates", lazy = FALSE, limit = 3)
head(graduates)

The output consists of the following:

Alternatively, we can query advisors:

advisors <- get_proquest(conn, from = "advisors") 
head(advisors)

Important here is that the gender refers to the advisor. If we wanted to study student-advisor matches, we could join the output from "graduates" to the output from "advisors", using the column goid for the join. Note that a student can have multiple advisors, as indicated by the position column.

Loading the links: get_links

This returns the links between records in MAG and in ProQuest. With from = "graduates", we get the links from current_links. With from = "advisors", we get the links from current_links_advisors.

links <- get_links(conn, from = "graduates", lazy = TRUE)
head(links)
links <- get_links(conn, from = "advisors", lazy = TRUE)
head(links)

As stated in the message, we suspect the links between advisors need a higher min_score constraint to omit many false positives.

Adding information to records: augment_tbl

The function has two purposes:

  1. Join output or affiliation information to author units

  2. Join information on affiliations of co-authors of author units

Join output or affiliation information to author units

The tables used to augment tbl contain multiple rows per AuthorId, as the following example illustrates:

graduates_affiliation <- get_links(
    conn = conn,
    from = "graduates") %>%
  augment_tbl(conn = conn,
              with = "affiliation") 

head(graduates_affiliation)

For each AuthorId, we have one row for each year in which there is at least one paper with a reported affiliation of the author. The same when using with = "output". The function checks if there is already a Year column in the tbl, and if so, it automatically joins on AuthorId and Year and therefore avoids duplicates.

Join information on affiliations of co-authors of author units.

TBD.

Processing after querying

When should you load the data into memory with collect()? I suggest to do as many operations as possible on the database with dbplyr. That said, when data from ProQuest and MAG are combined, the resulting tables are usually not very large and should fit into memory.

DBI::dbDisconnect(conn)


f-hafner/magutils documentation built on Sept. 20, 2023, 5:05 a.m.