knitr::opts_chunk$set( message = FALSE, warning = FALSE, comment = FALSE, eval = FALSE, engine.opts = list(bash = "-l") )
The raw data can be found on these websites:
Everyone
➜ FP7 Projects
➜ FP7 Project Details
➜ Top Funded Projects
Everyone
➜ H2020 Projects
➜ H2020 Projects
➜ Top Funded Projects
Everyone
➜ H2020 Projects
➜ Data Export
➜ Raw Data Export Sheet
The Horizon Dashboard consists of a Qlik Sense application. Raw data are used in the official statistics of the European Commission but cannot be downloaded automatically. However, certain datasets can still be manually exported to xlsx
files (see Sources above).
The EU produces snapshots of its publicly available data on the EU Data Portal. Data can be found for FP6, FP7, and H2020. For FP7 and H2020, the most recent data are stored in ZIP archives containing CSV files about projects and participants.
It is possible to download the data by hand or automatically. The following outlines how to download the data automatically with the command-line and the Wget command.
First, we create a text file, inst/url_list.txt
, with the list of URLs from which we would like to download the data files.
cat(readLines(here::here("inst", "url_list.txt")), sep = '\n')
Second, we create a shell script, inst/download_files.sh
, which downloads the data files to the data-raw/
folder.
```{bash, eval=FALSE}
while read url; do wget -P ../data-raw $url done < ../inst/url_list.txt
Besides, it is also possible to inspect when the files were last updated. We do this by creating another shell script, `inst/last_updated.sh`, which displays the `Last-Modidied` information for each file. ```{bash} #!/bin/bash while read url; do wget --server-response --spider $url 2>&1 | grep -i Last-Modified done < ../inst/url_list.txt
To finish, we extract the files from the ZIP archives. The unzip command fails to extract the files from the Zip archives. The most recent release contains a faulty Zip file structure. Fortunately, the tar command offers a workaround.
```{bash, eval=FALSE}
tar -xvf data-raw/cordis-fp7projects-csv.zip mkdir data-raw/fp7 mv csv data-raw/fp7 tar -xvf data-raw/cordis-h2020projects-csv.zip mkdir data-raw/h2020 mv csv data-raw/h2020
<br><br> ## Data Preprocessing The first step is to load a few useful [libraries](https://github.com/zambujo/swissparticipation/blob/main/R/package.r) and helper [functions](https://github.com/zambujo/swissparticipation/blob/main/R/utils.R). ```r source(here::here("R", "package.r")) source(here::here("R", "utils.r"))
The data from the Horizon Dashboard only cover FP7 and H2020. The details about the projects are given in two tables, one per Framework Programme. Therefore, we select the matching columns to make it possible to merge the two tables.
matching_columns <- c( "project_id", "project_acronym", "thematic_priority", "topic_code", "topic_description", "participations" ) fp7_dashboard <- # import xlsx to data frame here("data-raw", "fp7-projects.xlsx") %>% read_excel(guess_max = 20000) %>% # convert names to snake case clean_names() %>% # rename columns for consistency between tables rename( project_id = project_nbr, thematic_priority = thematic_priority_descr) %>% select(all_of(matching_columns)) %>% # data frame label column mutate(framework_programme = "FP7") h2020_dashboard <- # import xlsx to data frame here("data-raw", "h2020-projects.xlsx") %>% read_excel(guess_max = 20000) %>% # convert names to snake case clean_names() %>% # rename columns for consistency between tables rename( project_id = project_nbr, thematic_priority = thema, participations = h2020_participations ) %>% select(all_of(matching_columns)) %>% # data frame label column mutate(framework_programme = "H2020") # merge the tables projects_dashboard <- bind_rows(fp7_dashboard, h2020_dashboard)
Details about the projects can also be found in the data downloaded from the Data Portal. Similarly to the Horizon Dashboard data, each Framework Programme has its own data table from which we select a few additional columns of interest.
additional_columns <- c( "project_id", "start_date", "end_date", "legal_basis", "funding_scheme" ) # use H2020 schema as reference projects_cordis_h2020 <- read_csv2(here("data-raw", "h2020", "csv", "project.csv")) %>% # for data type consistency mutate_if(is.POSIXct, as.Date) projects_cordis_fp7 <- readr::read_csv2(here("data-raw", "fp7", "csv", "project.csv")) %>% # for data type consistency mutate( contentUpdateDate = as.Date(contentUpdateDate), ecSignatureDate = as.Date(ecSignatureDate), masterCall = as.character(masterCall)) projects_cordis <- bind_rows(projects_cordis_fp7, projects_cordis_h2020) %>% # use snake case clean_names() %>% # rename columns for consistency between tables rename(project_id = id) %>% select(all_of(additional_columns))
Finally, we combine the details from the Horizon Dashboard and the Data Portal.
projects <- full_join(projects_dashboard, projects_cordis, by = "project_id")
:::info
For completeness, we also include the code which reads the data available for FP6. As indicated already, the Horizon Dashboard contains no data for FP6. Moreover, the Data Portal only provides contribution amounts for 2,621 cases (out of 75,241). The data from FP6 will therefore be ignored in the remainder of this work.
# FP6 project details projects_cordis_fp6 <- here("data-raw", "cordis-fp6projects.csv") %>% readr::read_csv2() %>% clean_names() # FP6 participation/contribution details organizations_cordis_fp6 <- here("data-raw", "cordis-fp6organizations.csv") %>% read_tsv() # beware: tsv, not csv
:::
Horizon Dashboard only includes participation data for H2020 and not for FP7. Nevertheless, the data in the Horizon Dashboard appear to be of a higher quality than that of the Data Portal and are therefore used as a source of official statistics.
First, we save the data to CSV.
organizations_dashboard_h2020 <- here("data-raw", "horizon2020.xlsx") %>% read_excel() %>% clean_names() %>% rename(project_id = project_nbr, pic = general_pic) # add call year organizations_dashboard_h2020 <- organizations_dashboard_h2020 %>% mutate( call_date = as.Date(call_deadline_date, "%d/%m/%Y"), call_year = format(call_date, "%Y") ) %>% select(-call_date) # store data types map_chr(organizations_dashboard_h2020, class) %>% write_yaml(here("data", "schema-dashboard_export.yml")) # split data frame by call year yrs <- organizations_dashboard_h2020 %>% distinct(call_year) %>% pull() %>% sort() file_names <- here("data", glue("dashboard-export-{yrs}.csv")) organizations_dashboard_h2020 %>% arrange(call_year) %>% group_by(call_year) %>% group_split() %>% setNames(yrs) %>% # write to ./data walk2(file_names, vroom_write, delim = ",")
And select a few columns of interest.
additional_columns <- c( "project_id", "pic", "partner_role", "pillar_abbr", "pillar_descr", "thematic_priority_abbr", "signature_date", "call_deadline_date", "eu_contribution") organizations_dashboard_h2020 <- organizations_dashboard_h2020 %>% select(all_of(additional_columns))
Details about participation can be found in the organization.csv tables.
non_matching_columns <- c( "contentUpdateDate", "SME", "nutsCode", "endOfParticipation", "active", "totalCost", "rcn" ) selection_columns <- c( "project_id", "pic", "legal_name", "legal_short_name", "legal_entity_type", "country_code", "legal_url", "partner_role", "ec_contribution", "net_ec_contribution" ) organizations_cordis_h2020 <- here("data-raw", "h2020", "csv", "organization.csv") %>% read_csv2(guess_max = 100000) %>% select(-all_of(non_matching_columns)) organizations_cordis_fp7 <- here("data-raw", "fp7", "csv", "organization.csv") %>% read_csv2(guess_max = 130000) %>% select(-all_of(non_matching_columns)) # 3 cases where (project_id, pic) # appear 2x w/ different partner_role organizations_cordis <- organizations_cordis_fp7 %>% bind_rows(organizations_cordis_h2020) %>% clean_names() %>% rename( pic = organisation_id, legal_name = name, legal_short_name = short_name, legal_entity_type = activity_type, country_code = country, legal_url = organization_url, partner_role = role ) %>% select(all_of(selection_columns))
Here again, we combine the details from the Horizon Dashboard and the Data Portal.
organizations <- organizations_dashboard_h2020 %>% mutate(partner_role = str_to_lower(partner_role)) %>% # for joining full_join(organizations_cordis, by = c("project_id", "pic", "partner_role") )
Having combined the datasets from the Horizon Dashboard and the Data Portal, we obtain three columns with details about funding contributions: ec_contribution
(Data Portal), net_ec_contribution
(Data Portal), and eu_contribution
(Horizon Dashboard). The latter comprises the values of the Commission's official statistics for H2020. eu_contribution
is nevertheless only available for H2020. To extrapolate to FP7 statistics, we are going to compare the columns ec_contribution
and net_ec_contribution
with eu_contribution
.
# take a sample of 10'000 observations df_sample <- organizations %>% filter(!is.na(eu_contribution)) %>% filter(!is.na(ec_contribution)) %>% filter(!is.na(net_ec_contribution)) %>% select(ends_with("contribution")) %>% mutate(ec_contribution = str_replace(ec_contribution, ",", ".")) %>% mutate_all(as.numeric) %>% sample_n(10000) # comparing `ec_contribution` with `eu_contribution` with( df_sample, plot( ec_contribution, eu_contribution, type = "p", pch = 19, cex = .5, log = "xy" ) ) # comparing `net_ec_contribution` with `eu_contribution` with( df_sample, plot( net_ec_contribution, eu_contribution, type = "p", pch = 19, cex = .5, log = "xy" ) )
In addition, comparing of the number of projects between the projects
table and the organizations
table shows that twenty four projects lack details about individual funding contributions.
anti_join(organizations, projects, by = "project_id") anti_join(projects, organizations, by = "project_id") # => 24 projects with no participation details master_view <- organizations %>% left_join(projects, by = "project_id")
Below are a few steps to clean up and improve the data.
country_code
and add the corresponding country names.# EU Data Portal country codes country_codes <- glue( "https://cordis.europa.eu/", "data/reference/", "cordisref-countries.csv") %>% read_csv2() %>% clean_names() %>% filter(language == "en") %>% select(eu_code, name) %>% rename( country_code = eu_code, # two-letter Eurostat country code country = name) master_view <- master_view %>% mutate( country_code = str_sub(country_code, 1, 2), # two-letter only country_code = str_replace(country_code, "KO", "XK")) %>% left_join(country_codes, by = "country_code")
legal_entity_type
.master_view <- master_view %>% # `legal_entity_type` is a three-letter label only mutate(legal_entity_type = str_sub(legal_entity_type, 1, 3))
master_view <- master_view %>% mutate( ec_contribution = str_replace(ec_contribution, ",", "."), ec_contribution = str_replace(ec_contribution, fixed("xxxxx"), NA_character_), eu_contribution = str_replace(eu_contribution, fixed("-"), "0"), ec_contribution = as.numeric(ec_contribution), net_ec_contribution = as.numeric(net_ec_contribution), eu_contribution = as.numeric(eu_contribution), # reconcile FP7 and H2020 contribution data re_contribution = eu_contribution, re_contribution = if_else(is.na(re_contribution), net_ec_contribution, re_contribution), re_contribution = if_else(is.na(re_contribution), ec_contribution, re_contribution) )
master_view <- master_view %>% mutate( pillar = str_extract(pillar_abbr, "\\d"), pillar = if_else(pillar_abbr == "EU.0.", "Cross-theme", pillar), pillar = if_else(pillar_abbr == "Euratom", pillar_abbr, pillar))
master_view <- master_view %>% mutate( signature_date = as.Date(signature_date, format = "%d/%m/%Y"), call_deadline_date = as.Date(call_deadline_date, format = "%d/%m/%Y"), signature_year = str_sub(signature_date, 1, 4), call_year = str_sub(call_deadline_date, 1, 4), start_year = str_sub(start_date, 1, 4), signature_year = int(signature_year), call_year = int(call_year), start_year = int(start_year) )
partner_role
to title case.master_view <- master_view %>% mutate(partner_role = str_to_title(partner_role))
Finally, we can save master_vier
as a CSV file, which can be viewed in here.
# store data types master_view %>% map_chr(class) %>% write_yaml(here("data", "schema-cordis-plus.yml")) # split CSVs by year df_years <- master_view %>% distinct(start_year) %>% pull() %>% sort() # write to ./data file_names <- here("data", glue("cordis-plus-{df_years}.csv")) master_view %>% filter(!is.na(start_date)) %>% arrange(start_year) %>% group_by(start_year) %>% group_split() %>% setNames(df_years) %>% walk2(file_names, vroom_write, delim = ",")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.