In subugoe/bqschol: Access and analyse SUB Göttingen Big Scholarly Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

bqschol

The goal of bqschol is to provide an interface to SUB Göttingen's big scholarly datasets stored on Google Big Query.

This package is for internal use.

Installation

You can install the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("njahn82/bqschol")

Initialize the connection

Connect to dataset with Crossref metadata snapshots

library(bqschol)
my_con <- bqschol::bgschol_con(
  dataset = "cr_history")

Ideally you have a service account token stored as in a json to make use of this package. If not available, your Google account credentials will be requested via the web browser.

Table functions

The package provides wrapper for the most common table operations

bgschol_list(): List tables
bgschol_tbl(): Access tables with https://dplyr.tidyverse.org/
bgschol_query(): Perform of a SQL query and retrieve results
bgschol_execute(): Execute a SQL query on the database

Let's start by listing yearly Crossref historic snapshots.

bgschol_list(my_con)

We can determine the top publisher by type as of April 2018. Note that we only stored Crossref records published later than 2007.

library(dplyr)
cr_instant_df <- bgschol_tbl(my_con, table = "cr_apr18")
cr_instant_df %>%
    #top publishers
    dplyr::group_by(publisher) %>%
    dplyr::summarise(n = dplyr::n_distinct(doi)) %>%
    dplyr::arrange(desc(n))

For more complex tasks, we use SQL.

cc_query <- c("SELECT
  publisher,
  COUNT(DISTINCT(DOI)) AS n
FROM
  cr_apr18,
  UNNEST(license) AS license
WHERE
  REGEXP_CONTAINS(license.URL, 'creativecommons')
GROUP BY
  publisher
ORDER BY
  n DESC
LIMIT
  10")
bgschol_query(my_con, cc_query)