knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
The goal of cordis
is to simplify data access to data from CORDIS, which is an acronym for the Community Research and Development Information Service. It is the European Commission's primary source of results from the projects funded by the EU's framework programmes for research and innovation. This includes programmes from FP1 to Horizon 2020.
CORDIS makes open data about European research projects available at various locations, such as:
The download speed is rate limited when working directly against these files, and file formats and compression are different. With some data preparation, these datasets can be loaded into a database, providing simplified and faster local access to the data.
In data-raw
there are data preparation scripts which download data from these locations into a local cache directory (~/.cache/cordis
). A duckdb database is built from these files and uploaded using piggyback to a cordis-data GitHub repo as a versioned GitHub Release.
There is also a function for exporting the entire database in Parquet format, which allows moving the data for example to a Minio server, where it can be accessed by other data integration tools like duckdb, Arrow, Apache Spark etc. Only package developers need to use these functions when preparing and updating the data.
The data in the "github releases" location mentioned above can be installed locally by regular package users by running "cordis_import()", a function for importing the data from the data repository. This needs to be done once. The download and upload rate is good.
Users can then make a connection to the database locally. This allows arbitrary in-process data processing with tidyverse tools such as dplyr.
Convenience functions allows for inspecting the database schema / finding table and field names.
You can install the released version of cordis
from GitHub with:
devtools::install_github("KTH-Library/cordis") # run once to install local data cordis_import()
This is a basic example which shows you how to work with the data.
Goal: show how to list the available tables and schema
library(cordis) suppressPackageStartupMessages(library(dplyr)) library(knitr) # tables in the database, prefixed with .... # "ref" (reference data from CORDIS) # "he" (Horizon Europe) # "fp7" (FP7) # "h2020" (Horizon 2020). cordis_tables() |> arrange(desc(n_row)) |> print(n = 50) # database schema cordis_schema() %>% head(20)
Goal: To show how to work with data for Horizon Europe projects.
# get a connection con <- cordis_con() # remember to disconnect when done: # cordis_disconnect(con) # these tables are of primary interest cordis_tables() |> filter(grepl("^he_", table)) # display first row of projects info con |> tbl("he_project") |> head(1) |> glimpse() # display first five rows with PI data, exclude title and objective con |> tbl("he_project") |> select(-c("objective", "title")) |> head(5) |> knitr::kable() # display first row with publications data con |> tbl("he_projectDeliverables") |> head(1) |> select(-starts_with("X")) |> glimpse() cordis_disconnect(con)
Tables for Horizon 2020, FP7 etc are also available, as well as "reference data".
These datasets provide "reference data" for FP6, FP7, Horizon 2020 projects, see this source
Goal: Show how to work with reference data related to Horizon 2020 projects
# get a connection con <- cordis_con() # remember to disconnect when done: # cordis_disconnect(con) # use any of these tables cordis_tables() |> filter(grepl("^ref_", table)) # display first five rows with PI data con |> tbl("h2020_pi") |> head(5) |> knitr::kable() # display first row of projects info con |> tbl("h2020_project") |> head(1) |> glimpse() # display first row with publications data con |> tbl("h2020_projectPublications") |> head(1) |> glimpse() cordis_disconnect(con)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.