title: "Introduction to the convergEU package"
author: "
Federico M. Stefanini -
Update: Berta Mizsei"
date: "r Sys.Date()
- rel 1.1.0
Index:"
output:
rmarkdown::html_vignette:
fig_caption: yes
toc: true
number_sections: true
vignette: >
%\VignetteIndexEntry{User-Guide}
%\VignetteEncoding{UTF-8}
\usepackage[utf8]{inputenc}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
require(ggplot2) require(dplyr) require(tidyverse) require(convergEU) require(eurostat) require(purrr) require(tibble) require(tidyr) require(ggplot2) require(formattable) require(kableExtra) require(caTools) require(leaflet) require(leaflet.extras) require(htmlwidgets) require(webshot) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 )
The convergEU R package is a set of S3 functions and data objects suited for the analysis of economic and social convergence of Member States (MS) within the European Union (EU). The analyses performed by the package are not suitable for making causal inferences but they do allow the user to gain a wealth of insight into how convergence in certain indicators has evolved throughout time.
This vignette is intended to be a gentle introduction to the analysis of convergence performed with convergEU suite of functions. The package allows the user to access data from Eurofound (local) and Eurostat (download) with little effort. Furthermore, it is possible create or import custom datasets.
Since a dataset created or processed in this package should take the form of a tibble, at least some familiarity with the dplyr package (online dplyr site) is convenient. To implement convergence analysis with the covergEU package, the data must be in a tidy format - i.e. a rectangular table with time periods as the first column and units (e.g. Member Sates) in subsequent columns. Imported or downloaded data may not be in this tidy format, therefore the user will have to reformat their data to fit the requirements of the package. There are some cases where further elements of the tidyverse (https://www.tidyverse.org/) are used. For a general introduction to the tidyverse, see "R for Data Science" (online R4DS) by Wickham and Grolemund.
The most common data processed with the package are indicators downloaded from EU repositories, which are often tied to policy targets. It is important to note that the desired policy target for an indicator may be higher values (e.g. GDP/capita) or lower values (e.g. NEET rate). These are referred to as highBest and lowBest in the package: some functions require that the user provide an argument as a string chosen between the two possibilities "highBest" and "lowBest".
Keep in mind that functions in the convergEU package calculate summaries over sets of EU countries (e.g. the EU27 or the Euro area). Therefore, the dataset used must contain values for all countries within the selected grouping, even if the user is only interested in just two or three countries.
The convergEU package produces results as a list with metainformation. This list is made up of three components: \$res, \$msg, \$err. The first list component, \$res, is the actual result, if computed. The second component, \$msg is a message (possibly a warning) accompanying the computed result. The third component, \$err, is an error message or a list of errors if a result is not computed.
The R packages used in this vignette are:
require(convergEU) require(ggplot2) require(dplyr) require(tidyverse) require(eurostat) require(purrr) require(tibble) require(tidyr) require(ggplot2) require(formattable) require(caTools)
This vignette presents how to work with two types of data sources: data produced by Eurofound that are available without an active internet connection, and Eurostat data that can be downloaded with an active internet connection.
Some datasets are accessible from the convergEU package using the R function data(). The code below download a dataset with employment rates for the EU Member States.
library(convergEU) data("emp_20_64_MS",package = "convergEU") head(emp_20_64_MS)
The Eurofound datasets EWCS (Employment and Working Conditions Survey) and EQLS (European Quality of Life Survey) are locally available within convergEU, see:
data(package = "convergEU")
The object dbEUF2018meta
contains a description of data that is available from within the convergEU package, i.e. it does not need to be downloaded from the Eurofound website.
print(dbEUF2018meta, n=200,width=200)
The raw local Eurofound database is accessed as follows:
require(convergEU) data(dbEurofound) head(dbEurofound)
...where variable names are:
names(dbEurofound)
and the time ranges are:
c(min(dbEurofound$time), max(dbEurofound$time))
Remember that the databases will likely have missing data.
NOTE: Eurofound data are statically stored within convergeEU package. To have the most recent version of Eurofound data, please update the package.
The Eurofound dataset exploited during the development of this package: emp_20_64_MS is the tidy version of the dataset emp_20_64. It contains information on employment rates for 20 to 64 years old and can always be used for analysis without any need for further processing.
help(emp_20_64_MS)
All other locally accessible Eurofound indicators can be extracted from the Eurofound database as the first step of an analysis: this is the data preparation step. The user needs only to choose a time interval, an indicator and a set of countries (MS, Member States). For example:
convergEU_glb()$EU12
among those available:
names(convergEU_glb())[c(3:8)]
Remember that "EA" , "EA19" and "Euro area" are synonyms.
convergEU_glb()$EA
As an example, here is how one would select the "lifesatisf" indicator taken from the column "Code_in_database" within the object dbEUF2018meta that contains the meta information:
head(dbEUF2018meta) myTB <- extract_indicator_EUF( indicator_code = "lifesatisf", #Code_in_database fromTime=2003, toTime=2016, gender= c("Total","Females","Males")[2], countries= convergEU_glb()$EU12$memberStates$codeMS ) myTB
which results in three components: a dataset ready for further analyses ("$res"); possible messages for the user ("\$msg") and, if an error occurs, a string (i.e. text) containing error messages ("\$err").
Eurostat data can be downloaded directly from the Eurostat via an active internet connection.
The heterogeneity in the structure of different indicators normalized into the database requires some attentions. Sometimes, a list of covariates for the indicator is available (gender, age, poverty status, etc.) and these values must be set to produce a tidy dataset (i.e. a table containing only time by countries).
First, raw data may be downloaded using the option rawDump=T:
ddTB1 <- download_indicator_EUS( indicator_code= convergEU_glb()$metaEUStat$selectorUser[1], fromTime = 2005, toTime = 2015, gender= c(NA,"T","F","M")[2],#c("Total","Females","Males") countries = convergEU_glb()$EU28$memberStates$codeMS, rawDump=T ) ddTB1
sourceFile1 <- system.file("extdata", package = "convergEU") # save(ddTB1,file=file.path(sourceFile1,"ddTB1.RData")) load(file.path(sourceFile1,"ddTB1.RData")) ddTB1
This results in a dataset which is not in the tidy format.
Note that unit and isced11 are auxilary variables specific to this indicator and that some filtering must be performed to obtain a tidy dataset years by countries. The argument rawDump=F filters and reshapes the bulk data as shown below, where convergEU_glb()\$EU28\$memberStates\$codeMS is a vector of strings for the considered countries, convergEU_glb()\$metaEUStat\$selectorUser[1] contains the name of the indicator and ageInterv can be used to specify an age interval for a given indicator:
ddTB2 <- download_indicator_EUS( indicator_code= convergEU_glb()$metaEUStat$selectorUser[1], fromTime = 2005, toTime = 2015, gender= c(NA,"T","F","M")[1],#c("Total","Females","Males") ageInterv = NA, countries = convergEU_glb()$EU28$memberStates$codeMS, rawDump=F, uniqueIdentif = 1) convergEU_glb()$metaEUStat$selectorUser[1] ddTB2
convergEU_glb()$metaEUStat$selectorUser[1] sourceFile1 <- system.file("extdata", package = "convergEU") # save(ddTB2,file=file.path(sourceFile1,"ddTB2.RData")) load(file.path(sourceFile1,"ddTB2.RData")) ddTB2
The result is a list with the following components:
It is therefore possible to call the same function several times and specify the argument uniqueIdentif as an integer among those in the first column left of \$msg\$Further_Conditioning\$available_seleTagLs to obtain the same indicator under different scales and contexts. For example, here is how one would apply the fifth conditioning option that selects for males in the age interval "Y15-64":
ddTB3 <- download_indicator_EUS( indicator_code= convergEU_glb()$metaEUStat$selectorUser[1], fromTime = 2005, toTime = 2015, gender= "M", ageInterv = "Y15-64", countries = convergEU_glb()$EU28$memberStates$codeMS, rawDump=F, uniqueIdentif = 5) ddTB3
sourceFile1 <- system.file("extdata", package = "convergEU") # save(ddTB3,file=file.path(sourceFile1,"ddTB3.RData")) load(file.path(sourceFile1,"ddTB3.RData")) ddTB3
The analysis of convergence is performed on clean and imputed data that takes the shape of a tidy dataset with time by countries. The first step after downloading data is the description of the main features of such a dataset.
An illustrative example follows with the indicator "JQIintensity_i":
# print(dbEUF2018meta[11,],n=20,width=100) t(dbEUF2018meta[11,])
First the raw dataset is downloaded:
ddTB4 <- extract_indicator_EUF( indicator_code = "JQIintensity_i", #Code_in_database fromTime= 1965, toTime=2016, gender= c("Total","Females","Males")[1], countries= convergEU_glb()$EU28$memberStates$codeMS ) print(ddTB4$res,n=35,width=250)
sourceFile1 <- system.file("extdata", package = "convergEU") # save(ddTB4,file=file.path(sourceFile1,"ddTB4.RData")) load(file.path(sourceFile1,"ddTB4.RData")) print(ddTB4$res,n=35,width=250)
...where error messages are not shown (the \$err list component is empty).
The print output tells us that missing values are present. First, it is convenient to assign a meaningful name to the downloaded data:
JQIinte <- ddTB4$res
More features may be investigated as usual with common R functions, like the dimension of the dataset:
dim(JQIinte)
... that is, $5$ rows and $30$ columns. Here's how to check variable names:
names(JQIinte)
A dataset can't have qualitative variables, vectors of strings, or missing values for computing convergence measures. A time variable should also be present, and if the name is not "time", than it must be passed during function calls as an argument to have proper data processing. The check_data() function checks for unsuited features that must be solved before starting the analysis. The object returned states if the dataset is ready for calculations, and if it is not, the error component states why checking failed:
* "Error: one or more missing values in the dataframe."
* "Error: qualitative variables in the dataframe."
* "Error: string variables in the dataframe."
* "Error: timeName variable absent."
* "Error: time variable is not ordered."
For example, with the JQIinte data we have:
check_data(JQIinte,timeName="time")
Missing values are present, thus imputation is required. This can be done by using the impute_dataset function:
JQIinteImp <- impute_dataset(JQIinte, timeName = "time", countries=convergEU_glb()$EU28$memberStates$codeMS, tailMiss = c("cut", "constant")[2], headMiss = c("cut", "constant")[2])$res print(JQIinteImp,n=35,width=250)
\$res was added to the function call in order to use only the tibble containing years by countries. The imputation selected for the first (tail) and last (head) years is "constant", thus the first not missing value is propagated to missing years, but the alternative of cutting all years in which one or more missing are presents may be selected with the arguments:
tailMiss = "cut",headMiss = "cut"
The following code can be used to determine which years a country (here HR) is missing values for:
select(filter(JQIinte, is.na(HR)),time,HR)
This code does the same thing by using pipes (and the magrittr package):
JQIinte %>% filter(is.na(HR)) %>% select(time,HR)
The check_data function is called again but on imputed data:
check_data(JQIinteImp)
...where the suspected string variable is sex:
JQIinteFin <- dplyr::select(JQIinteImp,-sex) check_data(JQIinteFin)
...and thus JQIinteFin is the final object to start the data analysis.
The tidyverse functions mutate, select, filter are extremely useful for further selection and inspection, and can be used with or without the forward-pipe operator (%>%).
Note that before starting the analysis, the number of digits may be selected. For example, the above tibble can be rounded to integers by invoking the function round() where digits = 0:
JQIinteFin[, -1] <- round(select(JQIinteFin,- time), digits = 0) JQIinteFin
The basic imputation method is deterministic, like in the average of two interval endpoints separated by just one year missing. If several missing values are present in a row, a linear change of an indicator is assumed over time between the two observed time points flanking a chunk of missing values.
intervalTime <- c(1999,2000,2001,2002,2003) intervalMeasure <- c( 66.5, NA,NA,NA,87.2) currentData <- tibble(time= intervalTime, veval= intervalMeasure) currentData resImputed <- impute_dataset(currentData, countries = "veval", timeName = "time", tailMiss = c("cut", "constant")[2], headMiss = c("cut", "constant")[2]) resImputed$res
In the figure below, grey points are imputed using observed values represented by solid blue points:
tmp <- as.data.frame(currentData[ c(1,5),] ) tmp2 <- as.data.frame(resImputed$res[2:4,] ) myg <- ggplot(as.data.frame(resImputed$res), mapping=aes(x=time,y=veval)) + geom_point() + geom_line(data=resImputed$res,col="red") + geom_point(data=tmp,mapping=aes(x=time,y=veval), size=4, colour="blue") + geom_point(data= tmp2, aes(x=time,y=veval),size=4,alpha=1/3,col="black") + xlab("Time") + ylab("Measure / Index") + ggtitle( "Blue points are observed values,\n grey points are missing) \n") myg
It is important to note that $10\%$ of missing values for a country may already require substantive justification before interpreting results after imputation. It is up to the user whether they proceed with the analysis after checking the missingness of their data.
This section focuses on smoothing values observed in the considered time interval. A smoothing procedure removes sudden large changes, showing a less variable time series than the original. Given that short time series (panel data) are considered here, a three-point weighted average is proposed. The smoothing procedure substitutes an original raw value $y_{m,i,t}$ of country $m$ indicator $i$ at time $t$ with the weighted average $$\check{y}{m,i,t} = y{m,i,t-1} ~ (1-w)/2 +w ~y_{m,i,t} +y_{m,i,t+1} ~(1-w)/2$$ where $0< w \leq 1$. The special case $w=1$ corresponds to no smoothing. In case of missing values, an NA is returned. If the weight is outside the interval $(0,1]$, an NA is returned. The first and last values are smoothed using weights $w$ and $1-w$.
After loading the data, imputation and smoothing should be performed. The example below with countries IT and DE illustrates how to do so. First, check if missing values are present:
workTB <- dplyr::select(emp_20_64_MS, time, IT,DE) check_data(workTB)
If the dataframe passes the check, proceed to the smoothing step. Delete the time variable:
resSM <- smoo_dataset(select(workTB,-time), leadW = 0.149, timeTB= select(workTB,time)) resSM
For a comparison:
compaTB<-bind_cols(workTB, select(resSM,-time)) names(compaTB)<-c("time", "IT","IT1","DE", "DE1") compaTB
A graphical output shows changes for "IT", with the original index in blue and the smoothed index in red:
qplot(time,IT,data=compaTB) + geom_line(colour="navyblue") + geom_line(aes(x=time,y=IT1),colour="red") + geom_point(aes(x=time,y=IT1),colour="red",shape=8)
Similarly, for DE:
qplot(time,DE,data=compaTB) + geom_line(colour="navyblue") + geom_line(aes(x=time,y=DE1),colour="red") + geom_point(aes(x=time,y=DE1),colour="red",shape=8)
A weight equal to 1 leaves the data unchanged:
resSM <- smoo_dataset(select(workTB,-time), leadW = 1, timeTB= select(workTB,time)) compaTB <- bind_cols(workTB, select(resSM,-time)) names(compaTB)<-c("time", "IT","IT1","DE", "DE1") qplot(time,IT,data=compaTB) + geom_line(colour="navyblue") + geom_line(aes(x=time,y=IT1),colour="red") + geom_point(aes(x=time,y=IT1),colour="red",shape=8)
A time window larger than $3$ could be considered, but please consider that profound economic and social changes can occur in the EU over $3$ years.
Several alternative smoothing algorithms are available in R. Classical ma smoothers are described in the caTools package.
The emp_20_64_MS dataset is now chosen with Italy as the country operations will be showcased upon.
data(emp_20_64_MS) cuTB <- select(emp_20_64_MS,time) cuTB <- mutate(cuTB,ITori =emp_20_64_MS$IT)
At the beginning and end of this series values are averages calculated on a smaller and smaller number of observations (tails):
cuTB <- mutate(cuTB, IT_k_3= runmean(emp_20_64_MS$IT, k=3, alg=c("C", "R", "fast", "exact")[4], endrule=c("mean", "NA", "trim", "keep", "constant", "func")[4], align = c("center", "left", "right")[1])) cuTB <- mutate(cuTB, IT_k_5= runmean(emp_20_64_MS$IT, k=5, alg=c("C", "R", "fast", "exact")[4], endrule=c("mean", "NA", "trim", "keep", "constant", "func")[4], align = c("center", "left", "right")[1])) cuTB <- mutate(cuTB, IT_k_7= runmean(emp_20_64_MS$IT, k=7, alg=c("C", "R", "fast", "exact")[4], endrule=c("mean", "NA", "trim", "keep", "constant", "func")[4], align = c("center", "left", "right")[1]))
The options alg,endrule,align in the runmean function are discussed in the caTools package.
The figure below shows results for different degrees of smoothing: original (black), k=3 (red), k=5 (blue), k=7 (orange).
myG <- ggplot(cuTB,aes(x=time,y=ITori))+geom_line()+geom_point()+ geom_line(aes(x=time,y=IT_k_3),colour="red")+ geom_point(aes(x=time,y=IT_k_3),colour="red")+ # geom_line(aes(x=time,y=IT_k_5),colour="blue")+ geom_point(aes(x=time,y=IT_k_5),colour="blue")+ # geom_line(aes(x=time,y=IT_k_7),colour="orange")+ geom_point(aes(x=time,y=IT_k_7),colour="orange") myG
It is typically the case that the time series is so short that at $k=7$ a lot of observations are smoothed with different number of observations (shorter at start and end).
The above calculations can be performed as shown below in the convergEU package:
cuTB <- emp_20_64_MS[,c("time","IT","DE")] ma_dataset(cuTB, kappa=3, timeName= "time")
This approach is a bit less customizable, but it leads to a standard tidy dataset.
Absolute change for country $m$, indicator $i$ at time $t$ is defined as: $$ \Delta y_{m,i,t} = y_{m,i,t} - y_{m,i,t-1} $$
This can be calculated with the convergEU package with the function abso_change. In the emp_20_64_MS dataset, which is tidy and has no missing values:
data(emp_20_64_MS) mySTB <- abso_change(emp_20_64_MS, time_0 = 2005, time_t = 2010, all_within=TRUE, timeName = "time") names(mySTB$res)
The equation above results in:
mySTB$res$abso_change
If desired, less digits may be displayed. For instance, by rounding we get:
round(dplyr::select(mySTB$res$abso_change,AT:UK), 5)
The sum of absolute values
$$
\sum_{t=t_0+1}^{} | \Delta y_{m,i,t}|
$$
is:
round(mySTB$res$sum_abs_change,4)
Such sum can be divided by the number of pairs of years so that the result is an average per pair of years:
round(mySTB$res$average_abs_change,4)
The unweighted average of country values is an important summary statistic. The possible sets of countries that can be specified are stored within the function generating global static objects and tables, called convergEU_glb(). Below we showcase how to use this function with the emp_20_64_MS dataset.
First note that the EU area includes the following MS:
convergEU_glb()$Eurozone convergEU_glb()$EU19
The labels for the 28 MS are:
convergEU_glb()$EU28
The list of known MS labels is:
names(convergEU_glb())[3:7]
For example, the unweighted average in the emp_20_64_MS dataset for the EU28 would be:
average_clust(emp_20_64_MS, timeName = "time", cluster = "EU28")$res[,c(1,30)]
while for EU12 it would be:
average_clust(emp_20_64_MS, timeName = "time", cluster = "EU12")$res[,c(1,30)]
An unknown label, like "EUspirit", would cause a computational error:
average_clust(emp_20_64_MS,timeName = "time",cluster = "EUspirit")
...as would an incorrect time name:
average_clust(emp_20_64_MS,timeName = "TTime",cluster = "EA")
The time series can be plotted as shown:
wwTB <- average_clust(emp_20_64_MS,timeName = "time",cluster = "EU28")$res[,c(1,30)] mini_EU <- min(wwTB$EU28) maxi_EU <- max(wwTB$EU28) qplot(time, EU28, data=wwTB, ylim=c(mini_EU,maxi_EU))+geom_line(colour="navy blue")+ ylab("emp_20_64")
Several measures of convergence have been recently proposed by Eurofound (Eurofound, 2018). In this section, each each measure is introduced and its usage showcased.
Let's assume that we have a tidy dataset (tibble) in the form years by countries. The calculations for beta convergence are performed according the following linear model: $$ \tau^{-1}(ln(y_{m,i,t+\tau})-ln(y_{m,i,t})) = \beta_0 + \beta_1 ln(y_{m,i,t}) +\epsilon_{m,i,t} $$ where $m$ represents an EU Member State (country), $i$ refers to an indicator of interest, $t$ is the reference time and $\tau \in {1,2,\ldots}$ the length of the time window (typically $1$ or more years).
The output of beta_conv() is a list in which transformed data, the point estimate of $\beta_1$ and a standard two-tails test (including the p-value and adjusted R squared) is reported . While it is not implemented in the package, a one-tail test $H_0: \beta_1 \geq 0$ against $H_1: \beta1< 0$ may also be used. In the implementation of the function beta_conv(), the same reference time is maintained across different years. The division of the left hand side by the amount of time elapsed can be skipped by passing the argument useTau = FALSE.
Below is an example on how to invoke the function:
require(ggplot2) require(dplyr) require(tibble) empBC <- beta_conv(emp_20_64_MS, time_0 = 2002, time_t = 2006, all_within = FALSE, timeName = "time") empBC
Note that all_within = FALSE is the default.
A plot of transformed data and the regression line can be obtained by running:
qplot(empBC$res$workTB$indic, empBC$res$workTB$deltaIndic, xlab="log-Indicator", ylab="Delta-log-indicator") + geom_abline(intercept = as.numeric(empBC$res$summary[1,2]), slope = as.numeric(empBC$res$summary[2,2]), colour = "red") + geom_text(aes(label=empBC$res$workTB$countries), hjust=0, vjust=0,colour="blue")
Labels are replicated as many times as the number of included years if all_within=TRUE was specified. Furthermore, note that if the value of the indicator at the start or end time were 0, calculating beta convergence would be impossible (since the log of 0 is not defined). To bypass this and allow the calculation of beta-convergence, a very small constant (equal to a hundredth of the smallest value in the dataset) is added to the indicator where it equals 0. This allows the calculation of beta convergence and should not affect the outcome of the analysis.
The key concempt in sigma-convergence is variability with respect to the mean. Let $Y_{m,i,t}$ be the value of indicator $i$ for member state $m$ at time $t$, and $\overline{Y}_{A,i,t}$ the average over aggregation $A$. If $A = EU28$, then:
For each year, the summaries above are calculated to let the user see if convergence (i.e., a reduction in heterogeneity) took place. Below is how to test for sigma convergence across the entire time interval in the dataset:
mySTB <- sigma_conv(emp_20_64_MS, timeName="time") mySTB
It is also possible to specify a time interval:
sigma_conv(emp_20_64_MS, time_0 = 2002, time_t = 2004)
The departure from the mean, where $-1,0,1$ indicates values respectively below $-1$ within the interval $(-1,1)$ and above $+1$, can be characterized like so:
res <- departure_mean(oriTB = emp_20_64_MS, sigmaTB = mySTB$res) names(res$res) res$res$departures
Details on the contribution of each MS to the variance at a given time $t$ is evaluated by the square of the difference $(Y_{m,i,t} - \overline{Y}_{EU28,i,t})^2$ between the indicator $i$ of country $m$ at time $t$ and the unweighted average of the member states. These can be can be obtained by running:
res$res$squaredContrib
It is also possible to decompose the numerator of the variance, called deviance, at each time in order to calculate the percent contributed by each MS to the total deviance for indicator $i$ of country $m$ at time $t$. $$ 100 \cdot \frac{(Y_{m,i,t} - \overline{Y}{EU28,i,t})^2}{\sum{m} (Y_{m,i,t} - \overline{Y}_{EU28,i,t})^2} $$
res$res$devianceContrib
Notice that each row adds to $100$.
It is possible to produce a graphical output about the main features of a country's time series, as shown below:
myGG <- graph_departure(res$res$departures, timeName = "time", displace = 0.25, displaceh = 0.45, dimeFontNum = 4, myfont_scale = 1.35, x_angle = 45, color_rect = c("-1"='red1', "0"='gray80',"1"='lightskyblue1'), axis_name_y = "Countries", axis_name_x = "Time", alpha_color = 0.9 ) myGG
Any selection of countries is feasible:
#myWW1<- warnings() myGG <- graph_departure(res$res$departures[,1:10], timeName = "time", displace = 0.25, displaceh = 0.45, dimeFontNum = 4, myfont_scale = 1.35, x_angle = 45, color_rect = c("-1"='red1', "0"='gray80',"1"='lightskyblue1'), axis_name_y = "Countries", axis_name_x = "Time", alpha_color = 0.29 ) myGG
We now introduce gamma convergence by an index based on ranks.
Let $y_{m,i,t}$ be the value of indicator $i$ for member state $m$ at time $t=0,1,\ldots, T$, and ${ \tilde{y}{m,i,t}: m \in A )$ the ranks for indicator $i$ over member states in the reference set $A$, for example $A = EU28$, at a given time $t$. The sum of ranks within member state $m$ is: $$ \tilde{y}^{(s)}{m,i} = \sum_{t=0}^T \tilde{y}{m,i,t} $$ The variance of the sum of ranks over the given interval $$ Var\left[ {\tilde{y}^{(s)}{m,i}: m \in A } \right] $$ may be compared to the variance of ranks in the reference time $t=0$: $$ Var\left[ {\tilde{y}_{m,i,0}: m \in A } \right] $$
The Kendall index KI, with respect to cluster $A$ of member states for the indicator $i$ over a given time interval is: $$ KI(A,i,T) = \frac{Var\left[ {\tilde{y}^{(s)}{m,i}: m \in A } \right] }{ (T+1)^2 ~~Var\left[{\tilde{y}{m,i,0}: m \in A }\right] } $$
The measure of gamma-convergence is obtained with the following function:
gamma_conv(emp_20_64_MS,last=2016,ref=2002,timeName="time")
Let $y_{m,i,t}$ be the value of indicator $i$ for MS $m$ at time $t$, and $y^{(M)}{i,t}$ the maximum value over member states in the reference set $A$ (e.g., $A = EU28$): $$ y^{(M)}{i,t} = max({ y_{m,i,t}: m \in A}) $$
The distance of MS $m$ from the top performer at time $i$ is: $$ y^{(M)}{i,t} - y{m,i,t} $$ The overall distance at time $t$, called delta, is the sum of distances over the reference set $A$ for the considered indicator $i$. $$ \delta_{i,t} = \sum_{m \in A} (y^{(M)}{i,t} - y{m,i,t}) $$
Delta-convergence can be calculated as follows:
delta_conv(emp_20_64_MS,"time")
The convergEU package allows the user to produce scoreboards and fiches as HTML or pdf files in an automated way.
Scoreboards showcase the raw values of an indicator (level, $y_{m,i,t}$) for MS $m$ at time $t$ for indicator $i$. The difference between years, i.e. the change: $$ y_{m,i,t} - y_{m,i,t-1} $$ can be calculated by following the example below.
We can produce the scoreboard for the dataset emp_20_64_MS with the following:
data(emp_20_64_MS) resTB <- scoreb_yrs(emp_20_64_MS,timeName = "time") resTB
The result is a list with three components: the summary statistics, the numerical labels indicating the interval of the partition a level belongs to, *the interval of the partition a change belongs to.
Numerical labels are assigned as follows, see (DRAFT JOINT EMPLOYMENT REPORT FROM THE COMMISSION AND THE COUNCIL 2019 :
value $1$ if a the original level or change is $y \leq m -1 \cdot s$;
value $2$ if a the original level or change is $m -1\cdot s < y \leq m - 0.5\cdot s$;
value $3$ if a the original level or change is $m - 0.5\cdot s< y \leq m +0.5\cdot s$;
value $4$ if a the original level or change is $m +0.5\cdot s< y \leq m + 1\cdot s$;
* value $5$ if a the original level or change is $y > m +1\cdot s$.
We note that there is the possibility of representing the above summaries as coloured plots (TO DO) into scoreboards.
For the comparison of a country with the EU average, the following steps are recommended, from raw data:
# require(ggplot2) # data(emp_20_64_MS) selectedCountry <- "IT" timeName <- "time" myx_angle <- 45 outSig <- sigma_conv(emp_20_64_MS, timeName = timeName, time_0=2002,time_t=2016) miniY <- min(emp_20_64_MS[,- which(names(emp_20_64_MS) == timeName )]) maxiY <- max(emp_20_64_MS[,- which(names(emp_20_64_MS) == timeName )]) estrattore<- emp_20_64_MS[,timeName] >= 2002 & emp_20_64_MS[,timeName] <= 2016 ttmp <- cbind(outSig$res, dplyr::select(emp_20_64_MS[estrattore,], -contains(timeName))) myG2 <- ggplot(ttmp) + ggtitle( paste("EU average (black, solid) and country",selectedCountry ," (red, dotted)") )+ geom_line(aes(x=ttmp[,timeName], y =ttmp[,"mean"]),colour="black") + geom_point(aes(x=ttmp[,timeName],y =ttmp[,"mean"]),colour="black") + # geom_line()+geom_point()+ ylim(c(miniY,maxiY)) + xlab("Year") +ylab("Indicator") + theme(legend.position = "none")+ # add countries geom_line( aes(x=ttmp[,timeName], y = ttmp[,"IT"],colour="red"),linetype="dotted") + geom_point( aes(x=ttmp[,timeName], y = ttmp[,"IT"],colour="red")) + ggplot2::scale_x_continuous(breaks = ttmp[,timeName], labels = ttmp[,timeName]) + ggplot2::theme( axis.text.x=ggplot2::element_text( #size = ggplot2::rel(myfont_scale ), angle = myx_angle #vjust = 1, #hjust=1 )) myG2
It is also possible to graphically show departures in terms of the above defined partition:
obe_lvl <- scoreb_yrs(emp_20_64_MS,timeName = timeName)$res$sco_level_num # select subset of time estrattore <- obe_lvl[,timeName] >= 2009 & obe_lvl[,timeName] <= 2016 scobelvl <- obe_lvl[estrattore,] my_MSstd <- ms_dynam(scobelvl, timeName = "time", displace = 0.25, displaceh = 0.45, dimeFontNum = 3, myfont_scale = 1.35, x_angle = 45, axis_name_y = "Countries", axis_name_x = "Time", alpha_color = 0.9 ) my_MSstd
The convergEU package provides a function that automatically prepares one or more country fiches. The function allows the user to pass arguments to obtain information on various indicators and countries. The user specifies one key country and some other countries of interest can be listed to compare performances. Note that most arguments should be passed as strings instead of object names and that the dataset must be a complete (without missing values) tibble. Internet connection should be available when invoking the function to properly render the results.
Below is an example of a call to the function go_ms_fi() to illustrate the syntax. This command would create a country fiche for Germany comparing it with Italy, the UK and France for the timeframe 2002-2016. Note that most arguments are passed as strings instead of object names.
go_ms_fi( workDF ='myTB', countryRef ='DE', otherCountries = "c('IT','UK','FR')", time_0 = 2002, time_t = 2016, tName = 'time', indiType = "highBest", aggregation = 'EU27', x_angle = 45, dataNow = Sys.time(), author = 'A.Student', outFile = 'Germany-up2-2016', outDir = "/media/fred/STORE/PRJ/2018-TENDER-EU/STEP-1/bitbucketed/tt-fish", indiName = 'emp_20_64_MS', workTB = NULL )
Here is a breakdown of the arguments passed in the function: workDF is a string specifying the name of the working dataset that must be available in the global environment. countryRef is a string determining the country (or unit) of main interest. This country will be shown in one-country plots. the short name of a member country that will be shown in one-country plots. otherCountries specifies other countries that should be included in the analysis for comparison. time_0 specifies the starting time. time_t specifies the end time. tName is the name of the variable containing times. indiType specifies whether the indicator is of type "lowBest" or "highBest" (i.e. if a low or high value is desirable for a country). aggregation specifies the reference group of EU countries (e.g., 'EU27', 'EA'). If using a dataset with units other than EU Member States, then this should be 'custom'. x_angle determines the axis orientation for time labels in graphs. dataNow specifies the date of creation of the country fiche. Sys.time() will provide the exact date and time of when the function was run. author specifies the author of the report, which will be shown in the fiche. outFile is a string determining the name of the output file. This should not include a path. outDir determines the output directory, eventually not existing (only one level allowed). indiName is a string determining how the name of the considered indicator will appear in the fiche. workTB* is the name of a tibble containing data, optional, as an alternative to a global object.
Of particular importance the argument outFile that can be a string indicating the name of the output file. Similarly, outDir is the path (unit and folders) in which the final compiled html will be stored. The sintax of the path depend on the operating system; for example outDir='F:/analysis/IT2018' indicates that in the usb disk called 'F', within the folder 'analysis' is located folder 'IT2018' where R will write the country fiche. Note that a disk called 'F' must exist and also folder 'analysis' must exist in such unit, while on the contrary folder 'IT2018' is created by the function if it does not already exist.
Within the above mentioned output directory, besides the compiled HTML, a file called as specified in outFile is also stored, but with added the string '-workspace.RData' which contains data and plots produced during the compilation of the country fiche for further subsequent use in other technical reports.
The function go_indica_fi() allows the user to create indicator fiches in the form of an HTML or PDF file. Note that most arguments should be passed as strings instead of object names and that the dataset must be a complete (without missing values) tibble. Internet connection should be available when invoking the function to properly render the results.
An example of syntax to invoke the procedure is:
go_indica_fi( time_0 = 2005, time_t = 2010, timeName = 'time', workDF = 'emp_20_64_MS' , indicaT = 'emp_20_64', indiType = c('highBest','lowBest')[1], seleMeasure = 'all', seleAggre = 'EU28', x_angle = 45, data_res_download = FALSE, auth = 'A.Student', dataNow = '2019/05/16', outFile = "test_IT-emp_20_64_MS", outDir = "/media/fred/STORE/PRJ/2018-TENDER-EU/STEP-1/bitbucketed/tt-fish", pdf_out = FALSE, workTB = NULL, selfContained = FALSE, eige_layout = FALSE, memStates = 'quintiles'# ('quintiles', 'default', 'custom') ) )
Here is a breakdown of the arguments passed in the function: time_0 specifies the starting time. time_t specifies the end time. timeName is the name of the variable containing times. workDF is a string specifying the name of the working dataset that must be available in the global environment. indicaT is a string determining how the name of the considered indicator will appear in the fiche. indiType specifies whether the indicator is of type "lowBest" or "highBest" (i.e. if a low or high value is desirable for a country). seleMeasure determines which measures of convergence will be calculated. This is a subset of the following collection of strings: "all", beta","delta", "gamma","sigma". If uncertain, we recommend using "all", as it is a shortcut for the whole set. seleAggre specifies the set of EU countries (e.g., 'EU27', 'EU19') for which the analysis should be run. If using a dataset with units other than EU Member States, then this should be 'custom'. x_angle determines the axis orientation for time labels in graphs, the default is 45. data_res_download determines whether the data and results should be downloaded, the default is FALSE. author specifies the author of the report, which will be shown in the fiche. auth specifies the author of the report, which will be shown in the fiche. The default is 'A.Student'. dataNow specifies the date of creation of the indicator fiche. The default is the current date and time. outFile is a string determining the name of the output file. This should not include a path. outDir determines the output directory, eventually not existing (only one level allowed, in other words it cannot create a folder and a sub-folder, the folder should already exist and the subfolder will be created if specified in the path). pdf_out lets the user choose whether to create the fiche as an HTML or a PDF. The default is FALSE. If passed as TRUE, then the fiche will be created also as a PDF file. workTB is the name of a tibble containing data, optional, as alternative to a global object. selfContained should be set to TRUE if just one file is desired. eige_layout should be set to TRUE if the EIGE (European Institute for Gender Equality) layout is desired. memStates determines what kinds of visualisations and analyses are included in the fiche when comparing units. There are three options: "default", "quintiles", and "custom".
"Default" includes an exploration of how countries' standard deviations changed throughout the timeframe.This is the original charts contained in version 0.5.1.
"Quintiles", which is the selected option, sorts the Member States into quintiles based on the values of the indicator for each measured time in the timeframe. The evolution of the position of Member States can be tracked, and three maps are generated: the first depicting the quintile groupings at the start time, the second depicting the quintile groupings at the end time, and the third depicting the change between the start and the end time for each Member State's quintile group (e.g., if Hungary was in the first quintile in 2007 and the third in 2020, then that would be a change of +2 quintiles).
"Custom" should be chosen if a custom dataset containing units, for example regions, that are not compatible with the first two kinds of analyses was used. If "custom" is chosen, a .csv file with quintiles groupings will be created in the output folder along with the fiche.
Below the main references are listed:
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.