In KTH-Library/cordis: Data from CORDIS

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

cordis

The goal of cordis is to simplify data access to data from CORDIS, which is an acronym for the Community Research and Development Information Service. It is the European Commission's primary source of results from the projects funded by the EU's framework programmes for research and innovation. This includes programmes from FP1 to Horizon 2020.

Data preparation

CORDIS makes open data about European research projects available at various locations, such as:

The download speed is rate limited when working directly against these files, and file formats and compression are different. With some data preparation, these datasets can be loaded into a database, providing simplified and faster local access to the data.

In data-raw there are data preparation scripts which download data from these locations into a local cache directory (~/.cache/cordis). A duckdb database is built from these files and uploaded using piggyback to a cordis-data GitHub repo as a versioned GitHub Release.

There is also a function for exporting the entire database in Parquet format, which allows moving the data for example to a Minio server, where it can be accessed by other data integration tools like duckdb, Arrow, Apache Spark etc. Only package developers need to use these functions when preparing and updating the data.

Usage

The data in the "github releases" location mentioned above can be installed locally by regular package users by running "cordis_import()", a function for importing the data from the data repository. This needs to be done once. The download and upload rate is good.

Users can then make a connection to the database locally. This allows arbitrary in-process data processing with tidyverse tools such as dplyr.

Convenience functions allows for inspecting the database schema / finding table and field names.

Installation

You can install the released version of cordis from GitHub with:

devtools::install_github("KTH-Library/cordis")
# run once to install local data
cordis_import()

Example

This is a basic example which shows you how to work with the data.

Schema and available tables

Goal: show how to list the available tables and schema

library(cordis)
suppressPackageStartupMessages(library(dplyr))
library(knitr)

# tables in the database, prefixed with ....
# "ref" (reference data from CORDIS)
# "he" (Horizon Europe)
# "fp7" (FP7)
# "h2020" (Horizon 2020).

cordis_tables() |>
  arrange(desc(n_row)) |>
  print(n = 50)

# database schema
cordis_schema() %>%
  head(20)

Data from CORDIS projects

Goal: To show how to work with data for Horizon Europe projects.

# get a connection
con <- cordis_con()

# remember to disconnect when done:
# cordis_disconnect(con)

# these tables are of primary interest
cordis_tables() |>
  filter(grepl("^he_", table))

# display first row of projects info
con |> tbl("he_project") |> head(1) |> glimpse()

# display first five rows with PI data, exclude title and objective
con |> tbl("he_project") |> 
  select(-c("objective", "title")) |> 
  head(5) |> knitr::kable()

# display first row with publications data
con |> tbl("he_projectDeliverables") |> 
  head(1) |>
  select(-starts_with("X")) |>
  glimpse()

cordis_disconnect(con)

Tables for Horizon 2020, FP7 etc are also available, as well as "reference data".

Data from CORDIS reference data

These datasets provide "reference data" for FP6, FP7, Horizon 2020 projects, see this source

Goal: Show how to work with reference data related to Horizon 2020 projects

# get a connection
con <- cordis_con()

# remember to disconnect when done:
# cordis_disconnect(con)

# use any of these tables
cordis_tables() |> filter(grepl("^ref_", table))

# display first five rows with PI data
con |> tbl("h2020_pi") |> head(5) |> knitr::kable()

# display first row of projects info
con |> tbl("h2020_project") |> head(1) |> glimpse()

# display first row with publications data
con |> tbl("h2020_projectPublications") |> 
  head(1) |>
  glimpse()

cordis_disconnect(con)