In f-hafner/magutils: Utilities for loading data from MAG-Proquest database

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(magutils)
library(magrittr)

The package operates on a relational database of bibliometric data from Microsoft Academic Graph and dissertation data from ProQuest. To illustrate the usage, the package ships with a small replicate of the database with the necessary tables. Let's load the database:

db_file <- db_example("AcademicGraph.sqlite")
conn <- connect_to_db(db_file)

The tbls output above shows the all tables in the database.

Introducing the database

Overview

The database contains some of the tables originally in MAG and custom tables created by us. The tables taken from MAG have the same name as in MAG; the schema to their database is here.

Naming conventions

The names of the tables in the database and the columns in the tables follow two different conventions. Tables from MAG are named with CamelCase, as in the original database; custom tables created by us are named with snake_case (with a few exceptions not relevant to the front-end). Columns in MAG tables are named with CamelCase; columns in custom tables can be named either with CamelCase or snake_case.

Tables from ProQuest data are indicated with a "pq_" prefix; they and their columns are named with snake_case.

Database structure

The following describes each table; columns irrelevant to the package are omitted.

Tables `current_links` and `current_links_advisors`

Tables with links between records in MAG and ProQuest. They are created outside of the package.

current_links has links between the authors dissertations and authors in MAG. current_links_advisors has links between the advisors of dissertations and authors in MAG.

AuthorId: The author identifier in MAG.
link_score: The probability $\in [0, 1]$ that the linked entities are truly the same. Calculated from the linking algorithm not part of the package.
iteration_id: The iteration identifer of the linking algorithm; not relevant for this package.

The identifier in ProQuest differs:

for current_links, it is goid: The dissertation identifier in ProQuest.
for current_links_advisors, it is relationship_id: the combination of dissertation identifier and position of the reported advisor.

Table `FieldsOfStudy`

The fields of study as classified by MAG. The relevant columns are

FieldOfStudyId: field identifier.
NormalizedName: the name of the field.

Table `FirstNamesGender`

Maps first names to gender; data from genderize.io. Columns:

FirstName: string of first name
ProbabilityFemale: probability that a person with FirstName is female.

Table `pq_authors`

Table with authors of PhD dissertations from ProQuest. The relevant columns are:

goid
fullname, firstname
degree_year: year of PhD degree.
degree_level: level of the degree; usually PhD or similar.
university_id: identifier of the university.
thesistitle: title of the thesis, normalized to match the paper titles in MAG.

Table `pq_fields_mag`

The fields of the dissertation and their corresponding field (level 0) in MAG.

goid
position: position, starting at 0, as reported in the xml files of the dissertation. The ordering has, to our knowledge, no meaning; but some dissertations have only one field.
fieldname: the field name as reported in ProQuest
mag_field0: the corresponding field0 in MAG. We make this correspondence based on similarity of field names and on some tables of fields from the OECD.

Table `pq_unis`

Universities in ProQuest

university_id: university identifier
normalizedname: name of the university
location: location of the university. Not standardized, but location of U.S. universities have all a "United States" string in them.

Table `pq_advisors`

The dissertation advisors.

goid
position: advisor's position, starting at 0, as reported in the xml files of the dissertation. The ordering has, to our knowledge, no meaning; but some dissertations have only one field. A position $> 0$ indicates the dissertation reports more than one advisor.
lastname, firstname: reported names of the advisor
relationship_id: unique identifier of the student-advisor relationship. This is just the combined goid and the position.

Note that advisors do not have their own unique identifier.

Getting information on the database

We can query information from the database schema with some custom functions.

We can use the following query to see the specific contents of each table:

dplyr::glimpse(sqlite_master_to_df(conn = conn))

Using the package

Loading data from proquest: `get_proquest`

The function loads the dissertations written in the United States (available in ProQuest) between start_year and end_year. The underlying table used for the output depends on from and is thus either pq_authors when querying "graduates" or pq_advisors when querying "advisors". The final output of the function also depends on from, and in general the returned columns always refer to the respective units.

Let's see an example:

graduates <- get_proquest(conn, from = "graduates", lazy = FALSE, limit = 3)
head(graduates)

The output consists of the following:

The field of study, using define_field: The MAG field 0, mapped from the reported first field of the dissertation.
The gender, using define_gender: The imputed gender given the first name of the person.

Alternatively, we can query advisors:

advisors <- get_proquest(conn, from = "advisors") 
head(advisors)

Important here is that the gender refers to the advisor. If we wanted to study student-advisor matches, we could join the output from "graduates" to the output from "advisors", using the column goid for the join. Note that a student can have multiple advisors, as indicated by the position column.

Loading the links: `get_links`

This returns the links between records in MAG and in ProQuest. With from = "graduates", we get the links from current_links. With from = "advisors", we get the links from current_links_advisors.

links <- get_links(conn, from = "graduates", lazy = TRUE)
head(links)

links <- get_links(conn, from = "advisors", lazy = TRUE)
head(links)

As stated in the message, we suspect the links between advisors need a higher min_score constraint to omit many false positives.

Adding information to records: `augment_tbl`

The function has two purposes:

Join output or affiliation information to author units
Join information on affiliations of co-authors of author units

Join output or affiliation information to author units

The tables used to augment tbl contain multiple rows per AuthorId, as the following example illustrates:

graduates_affiliation <- get_links(
    conn = conn,
    from = "graduates") %>%
  augment_tbl(conn = conn,
              with = "affiliation") 

head(graduates_affiliation)

For each AuthorId, we have one row for each year in which there is at least one paper with a reported affiliation of the author. The same when using with = "output". The function checks if there is already a Year column in the tbl, and if so, it automatically joins on AuthorId and Year and therefore avoids duplicates.

Join information on affiliations of co-authors of author units.

TBD.

Processing after querying

When should you load the data into memory with collect()? I suggest to do as many operations as possible on the database with dbplyr. That said, when data from ProQuest and MAG are combined, the resulting tables are usually not very large and should fit into memory.

DBI::dbDisconnect(conn)

f-hafner/magutils documentation built on Sept. 20, 2023, 5:05 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

f-hafner/magutils
Utilities for loading data from MAG-Proquest database

In f-hafner/magutils: Utilities for loading data from MAG-Proquest database

Introducing the database

Overview

Database structure

Tables `current_links` and `current_links_advisors`

Table `FieldsOfStudy`

Table `FirstNamesGender`

Table `pq_authors`

Table `pq_fields_mag`

Table `pq_unis`

Table `pq_advisors`

Getting information on the database

Using the package

Loading data from proquest: `get_proquest`

Loading the links: `get_links`

Adding information to records: `augment_tbl`

Join output or affiliation information to author units

Join information on affiliations of co-authors of author units.

Processing after querying

R Package Documentation

Browse R Packages

We want your feedback!

f-hafner/magutils Utilities for loading data from MAG-Proquest database

In f-hafner/magutils: Utilities for loading data from MAG-Proquest database

Introducing the database

Overview

Database structure

Tables current_links and current_links_advisors

Table FieldsOfStudy

Table FirstNamesGender

Table pq_authors

Table pq_fields_mag

Table pq_unis

Table pq_advisors

Getting information on the database

Using the package

Loading data from proquest: get_proquest

Loading the links: get_links

Adding information to records: augment_tbl

Join output or affiliation information to author units

Join information on affiliations of co-authors of author units.

Processing after querying

R Package Documentation

Browse R Packages

We want your feedback!

f-hafner/magutils
Utilities for loading data from MAG-Proquest database

Tables `current_links` and `current_links_advisors`

Table `FieldsOfStudy`

Table `FirstNamesGender`

Table `pq_authors`

Table `pq_fields_mag`

Table `pq_unis`

Table `pq_advisors`

Loading data from proquest: `get_proquest`

Loading the links: `get_links`

Adding information to records: `augment_tbl`