knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(magutils) library(magrittr)
The package operates on a relational database of bibliometric data from Microsoft Academic Graph and dissertation data from ProQuest. To illustrate the usage, the package ships with a small replicate of the database with the necessary tables. Let's load the database:
db_file <- db_example("AcademicGraph.sqlite") conn <- connect_to_db(db_file)
The tbls
output above shows the all tables in the database.
The database contains some of the tables originally in MAG and custom tables created by us. The tables taken from MAG have the same name as in MAG; the schema to their database is here.
Naming conventions
The names of the tables in the database and the columns in the tables follow two different conventions. Tables from MAG are named with CamelCase, as in the original database; custom tables created by us are named with snake_case (with a few exceptions not relevant to the front-end). Columns in MAG tables are named with CamelCase; columns in custom tables can be named either with CamelCase or snake_case.
Tables from ProQuest data are indicated with a "pq_" prefix; they and their columns are named with snake_case.
The following describes each table; columns irrelevant to the package are omitted.
current_links
and current_links_advisors
Tables with links between records in MAG and ProQuest. They are created outside of the package.
current_links
has links between the authors dissertations and authors in MAG. current_links_advisors
has links between the advisors of dissertations and authors in MAG.
The identifier in ProQuest differs:
current_links
, it is goid: The dissertation identifier in ProQuest.current_links_advisors
, it is relationship_id: the combination of dissertation identifier and position of the reported advisor.FieldsOfStudy
The fields of study as classified by MAG. The relevant columns are
FirstNamesGender
Maps first names to gender; data from genderize.io. Columns:
FirstName
is female.pq_authors
Table with authors of PhD dissertations from ProQuest. The relevant columns are:
pq_fields_mag
The fields of the dissertation and their corresponding field (level 0) in MAG.
pq_unis
Universities in ProQuest
pq_advisors
The dissertation advisors.
goid
and the position
. Note that advisors do not have their own unique identifier.
We can query information from the database schema with some custom functions.
We can use the following query to see the specific contents of each table:
dplyr::glimpse(sqlite_master_to_df(conn = conn))
get_proquest
The function loads the dissertations written in the United States (available in ProQuest) between start_year
and end_year
.
The underlying table used for the output depends on from
and is thus either pq_authors
when querying "graduates" or pq_advisors
when querying "advisors".
The final output of the function also depends on from
, and in general the returned columns always refer to the respective units.
Let's see an example:
graduates <- get_proquest(conn, from = "graduates", lazy = FALSE, limit = 3) head(graduates)
The output consists of the following:
define_field
: The MAG field 0, mapped from the reported first field of the dissertation. define_gender
: The imputed gender given the first name of the person.Alternatively, we can query advisors:
advisors <- get_proquest(conn, from = "advisors") head(advisors)
Important here is that the gender refers to the advisor. If we wanted to study student-advisor matches, we could join the output from "graduates" to the output from "advisors", using the column goid
for the join. Note that a student can have multiple advisors, as indicated by the position
column.
get_links
This returns the links between records in MAG and in ProQuest. With from = "graduates"
, we get the links from current_links
. With from = "advisors"
, we get the links from current_links_advisors
.
links <- get_links(conn, from = "graduates", lazy = TRUE) head(links)
links <- get_links(conn, from = "advisors", lazy = TRUE) head(links)
As stated in the message, we suspect the links between advisors need a higher min_score
constraint to omit many false positives.
augment_tbl
The function has two purposes:
Join output or affiliation information to author units
Join information on affiliations of co-authors of author units
The tables used to augment tbl
contain multiple rows per AuthorId, as the following example illustrates:
graduates_affiliation <- get_links( conn = conn, from = "graduates") %>% augment_tbl(conn = conn, with = "affiliation") head(graduates_affiliation)
For each AuthorId
, we have one row for each year in which there is at least one paper with a reported affiliation of the author.
The same when using with = "output"
.
The function checks if there is already a Year
column in the tbl
, and if so, it automatically joins on AuthorId
and Year
and therefore avoids duplicates.
TBD.
When should you load the data into memory with collect()
? I suggest to do as many operations as possible on the database with dbplyr
. That said, when data from ProQuest and MAG are combined, the resulting tables are usually not very large and should fit into memory.
DBI::dbDisconnect(conn)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.