knitr::opts_chunk$set( comment = "#>", collapse = TRUE, warning = FALSE, message = FALSE, echo = FALSE ) library(openairegraph) library(tidyverse)
OpenAIRE has collected and interlinked scholarly data from various openly available sources for over ten years. In December 2019, this European open science network released the OpenAIRE Research Graph Dump [@manghi_paolo_2019_3516918], a big scholarly dataset that comprises metadata about more than 100 million research publications and 18 million research datasets, as well as the relationships between them. These metadata are furthermore connected to open access locations and disambiguated information about persons, organisations and funders.
Like most big scholarly data dumps, the OpenAIRE Research Graph offers many opportunities for data analysis, but working with it can be challenging for data science practitioners [@Xia_2017]. One reason is the size of the data release. Although already split into several files, most of these data files are too large to fit into the memory of a moderately equipped personal computer when directly imported into statistical computing environments. Another challenge is the format. Statistical methods require tidy data, i.e. datasets with a rectangular structure made up of unambiguous columns and rows [@Wickham_2014]. The dump, however, consists of compressed XML-files following the comprehensive OpenAIRE data model [@manghi_paolo_2019_2643199], from which only certain elements may be needed for a specific data analysis.
In this paper, we will introduce the R package openairegraph that helps to transform the large OpenAIRE Research Graph Dump into relevant small datasets for analytical purposes. It aims at data science practitioners and researchers alike who wish to conduct their own statistical analysis using the OpenAIRE Research Graph, but are wary of handling its large data dumps. Our implementation does not only consist of tools that allow to explore and parse specific information from the OpenAIRE Research Graph Dump. The R package structure also provides a standard way to ensure the computational reproducibility of these tools. To demonstrate the usefulness of our implementation, we will benchmark compliance with the open access mandate in the European HORIZON 2020 funding programme.
R packages are an open and standardized way to write, document, test, and share R code. They are not limited to statistical methods, but can be also used to access, or even disseminate various types of data [@wickham_2015]. An increasing proportion of research articles from various fields cite or report about R packages [@Tippmann_2014].
R packages meet generic principles for computational reproducibility: coherent file organisation, separation of data, method and results, and specification of the computational environment [@Marwick_2018]. For data science practitioners, R packages, thus, provide a reliable way to re-use code from others. In the field of scholarly data, the rOpenSci initiative provides interfaces to many sources. Before becoming part of the rOpenSci suite, packages are openly reviewed against comprehensive criteria [@ropensci_2020_3749013].
The R package openairegraph, which is available on GitHub as a development version^1, has two sets of functions. The first set provides helpers to split the dump release into separate, de-coded XML records that can be stored individually. The other set consists of parsers that convert data from these XML files to a table-like data representation following the tidyverse philosophy, a popular approach to practise data science with tidy data [@tidyverse]. Splitting, de-coding and parsing are essential steps before analysing the OpenAIRE Research Graph.
The function openairegraph::oarg_decode()
splits and de-codes each record. Storing the records individually will allow to process the files independent from each other, which is a common approach when working with big data.
dir.create("data") download.file( url = "https://zenodo.org/record/3516918/files/h2020_results.gz", destfile = "data/h2020_results.gz" ) oaire <- jsonlite::stream_in(con = file("data/h2020_results.gz")) %>% tibble::as_tibble() dir.create("data/records") oarg_decode( oaire = oaire, records_path = "data/records/", limit = 500, verbose = FALSE )
Because the dumps are quite large, the function furthermore has a parameter that allows setting a limit, which is helpful for inspecting the output first. By default, a progress bar presents the current state of the process.
So far, there are four parsers available:
openairegraph::oarg_publications_md()
retrieves basic publication metadata complemented by author details and access statusopenairegraph::oarg_linked_projects()
parses grants linked to publicationsopenairegraph::oarg_linked_ftxt()
gives full-text links including access informationopenairegraph::oarg_linked_affiliations()
parses affiliation dataThese parsers can be used alone, or together like this:
openaire_records <- list.files("data/records", full.names = TRUE) library(future) # necessary for now to create multisession if (!identical(Sys.getenv("GITHUB_ACTIONS"), "true")) { future::plan(multisession) tictoc::tic() oaire_data <- future.apply::future_lapply(openaire_records, function(files) { # load xml file doc <- xml2::read_xml(files) # parser out <- oarg_publications_md(doc) out$linked_projects <- list(oarg_linked_projects(doc)) out$linked_ftxt <- list(oarg_linked_ftxt(doc)) # use file path as id out$id <- files out }) tictoc::toc() } else { oaire_data <- lapply(openaire_records, function(files) { # load xml file doc <- xml2::read_xml(files) # parser out <- oarg_publications_md(doc) out$linked_projects <- list(oarg_linked_projects(doc)) out$linked_ftxt <- list(oarg_linked_ftxt(doc)) # use file path as id out$id <- files out }) } oaire_df <- dplyr::bind_rows(oaire_data)
First, we obtained the locations of the de-coded XML records. After that, we read each XML file using the xml2 [@xml2] package, and applied three parsers. We used the future [@future] and future.apply [@future_apply] packages to enable reading and parsing these records simultaneously with multiple R sessions. Running code in parallel reduces the execution time. Finally, we put together the individually created datasets into one dataset.
Documentation is an essential part of the R package workflow. For each parser, we described the usage, as well as the returned output. The latter is represented as data coding tables with an description of each variables, making it more convenient to grasp the otherwise complex structure of the OpenAIRE Research Graph. While executable examples are provided for each function, an introductory vignette, an executable long-form documentation written in R Markdown[@rmarkdown], demonstrates a workflow using the package. All documentation is automatically rendered to an R manual, and to a package website^2 created with pkgdown [@pkgdown].
While writing the package, we collected exemplary OpenAIRE Research Graph records and tested our parser against these files. The package provides testthat [@testthat] unit testing for these cases to make sure that the parsers work as expected. Along with the general automated checking routines for a R package build, these tests will be executed on a continuous integration services after every code updates. The package is available via GitHub, and can be installed from it. Following good practices for research software citation and preservation, releases are archived via Zenodo, which issues DOIs for persistent identification. Altogether, this release workflow enables that the package is "both archived for reproducibility and actively maintained for reusability" [@hasselbring2019fair].
A main purpose of the OpenAIRE Research Graph, which links grants to publications and open access full-texts, is monitoring the efficacy of the open access mandates. Here, we will demonstrate how to benchmark the compliance with the open access mandate of the European Commission’s HORIZON 2020 funding programme (H2020) against related funding activities using our implementation. We will focus on projects affiliated with the University of Göttingen. In doing so, we want to take into account that open access mandates generally lead to above-average open access uptake levels. Yet, they can vary by discipline and funding activity [@Larivi_re_2018].
# the parsed dump is too large to be tracked with GIT # Here's how to download it and put it in the # folder of this vignette library(tidyverse) download.file( url = "https://github.com/subugoe/scholcomm_analytics/releases/download/oaire_graph_post/h2020_parsed.json.gz", destfile = "data/h2020_parsed.json.gz" ) oaire_df <- jsonlite::stream_in( con = file("data/h2020_parsed.json.gz"), verbose = TRUE ) %>% tibble::as_tibble() pubs_projects <- oaire_df %>% select(id, type, best_access_right, linked_projects) %>% unnest_legacy(linked_projects) oa_monitor_ec <- pubs_projects %>% filter(funding_level_0 == "H2020") %>% mutate(funding_scheme = fct_infreq(funding_level_1)) %>% group_by( funding_scheme, project_code, project_acronym, best_access_right ) %>% summarise(oa_n = n_distinct(id)) %>% # per pub mutate(oa_prop = oa_n / sum(oa_n)) %>% filter(best_access_right == "Open Access") %>% ungroup() %>% mutate(all_pub = as.integer(oa_n / oa_prop))
As a start, we imported and transformed the whole h2020_results.gz dump file following the above-described methods. The resulting dataset comprises 84,781 literature publications from 9,008 H2020 projects. After that, we identified H2020 projects with participation from the University of Göttingen using data from CORDIS, the European Commission’s research information portal. We calculated the open access share per project affiliated with the university, and compared them by H2020 funding stream, e.g. European Research Council (ERC) grants or Marie Skłodowska-Curie actions (MCSA).
# load local copy downloaded from the EC open data portal download.file( url = "https://cordis.europa.eu/data/cordis-h2020organizations.csv", destfile = "data/cordis-h2020organizations.csv" ) cordis_org <- readr::read_delim( file = "data/cordis-h2020organizations.csv", delim = ";", locale = readr::locale(decimal_mark = ",") ) %>% # data cleaning mutate_if(is.double, as.character) # tag projects affiliated with the University of Göttingen ugoe_projects <- cordis_org %>% filter(shortName %in% c("UGOE", "UMG-GOE")) %>% select(project_id = projectID, role, project_acronym = projectAcronym) pubs_projects_ugoe <- pubs_projects %>% mutate(ugoe_project = funding_level_0 == "H2020" & project_code %in% ugoe_projects$project_id) ugoe_funding_programme <- pubs_projects_ugoe %>% filter(ugoe_project == TRUE) %>% group_by(funding_level_1, project_code) %>% # min 5 pubs summarise(n = n_distinct(id)) %>% filter(n >= 5) %>% distinct(funding_level_1, project_code) goe_oa <- oa_monitor_ec %>% # min 5 pubs filter(all_pub >=5) %>% filter(funding_scheme %in% ugoe_funding_programme$funding_level_1) %>% mutate(ugoe = project_code %in% ugoe_funding_programme$project_code) %>% mutate(`H2020 project` = paste0(project_acronym, " | OA share: ", round(oa_prop * 100, 0), "%")) # plot as interactive graph using plotly ggplot(goe_oa, aes(funding_scheme, oa_prop)) + geom_boxplot() + geom_jitter(data = filter(goe_oa, ugoe == TRUE), aes(label = `H2020 project`), colour = "#AF42AE", alpha = 0.9, size = 3, width = 0.25) + geom_hline(aes( yintercept = mean(oa_prop), color = paste0("Mean=", as.character(round( mean(oa_prop) * 100, 0 )), "%") ), linetype = "dashed", size = 1) + geom_hline(aes( yintercept = median(oa_prop), color = paste0("Median=", as.character(round( median(oa_prop) * 100, 0 )), "%") ), linetype = "dashed", size = 1) + scale_color_manual(NULL, values = c("orange", "darkred")) + scale_x_discrete(guide = guide_axis(n.dodge = 2)) + scale_y_continuous(labels = scales::percent_format(accuracy = 5L)) + labs(x = NULL, y = "Open Access Percentage", caption = "Data: OpenAIRE Research Graph") + theme_minimal() + theme(legend.position = "top", legend.justification = "right")
Figure 1 shows that some H2020-projects with University of Göttingen participation have an uptake of open access to grant-supported publications that is above the average in the peer group. At the same time, others perform below expectation. Together, this provides valuable insights into open access compliance at the institutional level, especially for research administrators who would like to consider differences by research field and grant type while assessing publication practises.
In summary, we have presented the R package openairegraph, and described how it helps data science practitioners working primarily in R to access and analyse the OpenAIRE Research Graph Dump. R packages, which follow a standardized structure including documentation and testing, are a well-established way to share code. Wrapping helpers into a R package, allows data science practitioners to explore big scholarly data dumps while shielding them from its complexity. As our use case has demonstrated, this approach provides a reliable way to investigate large variations in scholarly communication in a reproducible manner.
Not all data science practitioners use R. We encourage others to develop helpers for analysing big scholarly data dumps in other programming languages as well. Their implementation can draw from the standardized way how R packaging documents, tests and shares code for statistical analyses. Implementing such tools for statistical computing environments can greatly improve re-use of big scholarly data dumps in quantitative science studies and data-driven decision-making.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.