Introduction to the convergEU package"


title: "Introduction to the convergEU package" author: "
Federico M. Stefanini -
Update: Berta Mizsei" date: "r Sys.Date() - rel 1.1.0

Index:" output:
rmarkdown::html_vignette: fig_caption: yes toc: true number_sections: true vignette: > %\VignetteIndexEntry{User-Guide} %\VignetteEncoding{UTF-8} \usepackage[utf8]{inputenc} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console


require(ggplot2)
require(dplyr)
require(tidyverse)
require(convergEU)
require(eurostat)
require(purrr)
require(tibble)
require(tidyr)
require(ggplot2)
require(formattable) 
require(kableExtra)
require(caTools)
require(leaflet)
require(leaflet.extras)
require(htmlwidgets)
require(webshot)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)




Introduction

The convergEU R package is a set of S3 functions and data objects suited for the analysis of economic and social convergence of Member States (MS) within the European Union (EU). The analyses performed by the package are not suitable for making causal inferences but they do allow the user to gain a wealth of insight into how convergence in certain indicators has evolved throughout time.

This vignette is intended to be a gentle introduction to the analysis of convergence performed with convergEU suite of functions. The package allows the user to access data from Eurofound (local) and Eurostat (download) with little effort. Furthermore, it is possible create or import custom datasets.

Since a dataset created or processed in this package should take the form of a tibble, at least some familiarity with the dplyr package (online dplyr site) is convenient. To implement convergence analysis with the covergEU package, the data must be in a tidy format - i.e. a rectangular table with time periods as the first column and units (e.g. Member Sates) in subsequent columns. Imported or downloaded data may not be in this tidy format, therefore the user will have to reformat their data to fit the requirements of the package. There are some cases where further elements of the tidyverse (https://www.tidyverse.org/) are used. For a general introduction to the tidyverse, see "R for Data Science" (online R4DS) by Wickham and Grolemund.

The most common data processed with the package are indicators downloaded from EU repositories, which are often tied to policy targets. It is important to note that the desired policy target for an indicator may be higher values (e.g. GDP/capita) or lower values (e.g. NEET rate). These are referred to as highBest and lowBest in the package: some functions require that the user provide an argument as a string chosen between the two possibilities "highBest" and "lowBest".

Keep in mind that functions in the convergEU package calculate summaries over sets of EU countries (e.g. the EU27 or the Euro area). Therefore, the dataset used must contain values for all countries within the selected grouping, even if the user is only interested in just two or three countries.

The convergEU package produces results as a list with metainformation. This list is made up of three components: \$res, \$msg, \$err. The first list component, \$res, is the actual result, if computed. The second component, \$msg is a message (possibly a warning) accompanying the computed result. The third component, \$err, is an error message or a list of errors if a result is not computed.

The R packages used in this vignette are:

require(convergEU)
require(ggplot2)
require(dplyr)
require(tidyverse)
require(eurostat)
require(purrr)
require(tibble)
require(tidyr)
require(ggplot2)
require(formattable) 
require(caTools)


Loading and preparing data

This vignette presents how to work with two types of data sources: data produced by Eurofound that are available without an active internet connection, and Eurostat data that can be downloaded with an active internet connection.

Locally accessible datasets: Eurofound data

Some datasets are accessible from the convergEU package using the R function data(). The code below download a dataset with employment rates for the EU Member States.

library(convergEU)
data("emp_20_64_MS",package = "convergEU")
head(emp_20_64_MS)

The Eurofound datasets EWCS (Employment and Working Conditions Survey) and EQLS (European Quality of Life Survey) are locally available within convergEU, see:

data(package = "convergEU")

The object dbEUF2018meta contains a description of data that is available from within the convergEU package, i.e. it does not need to be downloaded from the Eurofound website.

print(dbEUF2018meta, n=200,width=200)             

The raw local Eurofound database is accessed as follows:

require(convergEU)
data(dbEurofound)
head(dbEurofound)

...where variable names are:

names(dbEurofound)

and the time ranges are:

c(min(dbEurofound$time), max(dbEurofound$time))

Remember that the databases will likely have missing data.

NOTE: Eurofound data are statically stored within convergeEU package. To have the most recent version of Eurofound data, please update the package.

The Eurofound dataset exploited during the development of this package: emp_20_64_MS is the tidy version of the dataset emp_20_64. It contains information on employment rates for 20 to 64 years old and can always be used for analysis without any need for further processing.

help(emp_20_64_MS)

All other locally accessible Eurofound indicators can be extracted from the Eurofound database as the first step of an analysis: this is the data preparation step. The user needs only to choose a time interval, an indicator and a set of countries (MS, Member States). For example:

convergEU_glb()$EU12

among those available:

names(convergEU_glb())[c(3:8)]

Remember that "EA" , "EA19" and "Euro area" are synonyms.

convergEU_glb()$EA

As an example, here is how one would select the "lifesatisf" indicator taken from the column "Code_in_database" within the object dbEUF2018meta that contains the meta information:

head(dbEUF2018meta)
myTB <- extract_indicator_EUF(
    indicator_code = "lifesatisf", #Code_in_database
    fromTime=2003,
    toTime=2016,
    gender= c("Total","Females","Males")[2],
    countries= convergEU_glb()$EU12$memberStates$codeMS
    )

myTB

which results in three components: a dataset ready for further analyses ("$res"); possible messages for the user ("\$msg") and, if an error occurs, a string (i.e. text) containing error messages ("\$err").

Downloadable data: Eurostat repository

Eurostat data can be downloaded directly from the Eurostat via an active internet connection.

The heterogeneity in the structure of different indicators normalized into the database requires some attentions. Sometimes, a list of covariates for the indicator is available (gender, age, poverty status, etc.) and these values must be set to produce a tidy dataset (i.e. a table containing only time by countries).

First, raw data may be downloaded using the option rawDump=T:

ddTB1 <- download_indicator_EUS(
      indicator_code= convergEU_glb()$metaEUStat$selectorUser[1],
      fromTime = 2005,
      toTime = 2015,
      gender= c(NA,"T","F","M")[2],#c("Total","Females","Males")
      countries =  convergEU_glb()$EU28$memberStates$codeMS,
      rawDump=T )
ddTB1
sourceFile1 <- system.file("extdata", package = "convergEU")
# save(ddTB1,file=file.path(sourceFile1,"ddTB1.RData"))
load(file.path(sourceFile1,"ddTB1.RData"))
ddTB1

This results in a dataset which is not in the tidy format.

Note that unit and isced11 are auxilary variables specific to this indicator and that some filtering must be performed to obtain a tidy dataset years by countries. The argument rawDump=F filters and reshapes the bulk data as shown below, where convergEU_glb()\$EU28\$memberStates\$codeMS is a vector of strings for the considered countries, convergEU_glb()\$metaEUStat\$selectorUser[1] contains the name of the indicator and ageInterv can be used to specify an age interval for a given indicator:

ddTB2 <- download_indicator_EUS(
      indicator_code= convergEU_glb()$metaEUStat$selectorUser[1],
      fromTime = 2005,
      toTime = 2015,
      gender= c(NA,"T","F","M")[1],#c("Total","Females","Males")
      ageInterv = NA,
      countries =  convergEU_glb()$EU28$memberStates$codeMS,
      rawDump=F,
      uniqueIdentif = 1)

convergEU_glb()$metaEUStat$selectorUser[1]
ddTB2
convergEU_glb()$metaEUStat$selectorUser[1]
sourceFile1 <- system.file("extdata", package = "convergEU")
# save(ddTB2,file=file.path(sourceFile1,"ddTB2.RData"))
load(file.path(sourceFile1,"ddTB2.RData"))
ddTB2

The result is a list with the following components:

It is therefore possible to call the same function several times and specify the argument uniqueIdentif as an integer among those in the first column left of \$msg\$Further_Conditioning\$available_seleTagLs to obtain the same indicator under different scales and contexts. For example, here is how one would apply the fifth conditioning option that selects for males in the age interval "Y15-64":

ddTB3 <- download_indicator_EUS(
      indicator_code= convergEU_glb()$metaEUStat$selectorUser[1],
      fromTime = 2005,
      toTime = 2015,
      gender= "M",
      ageInterv = "Y15-64",
      countries =  convergEU_glb()$EU28$memberStates$codeMS,
      rawDump=F,
      uniqueIdentif = 5)
ddTB3
sourceFile1 <- system.file("extdata", package = "convergEU")
# save(ddTB3,file=file.path(sourceFile1,"ddTB3.RData"))
load(file.path(sourceFile1,"ddTB3.RData"))
ddTB3


Data preparation: from data structure to imputation of missing values

The analysis of convergence is performed on clean and imputed data that takes the shape of a tidy dataset with time by countries. The first step after downloading data is the description of the main features of such a dataset.

An illustrative example follows with the indicator "JQIintensity_i":

# print(dbEUF2018meta[11,],n=20,width=100)
t(dbEUF2018meta[11,])

First the raw dataset is downloaded:

ddTB4 <- extract_indicator_EUF(
    indicator_code = "JQIintensity_i", #Code_in_database
    fromTime= 1965,
    toTime=2016,
    gender= c("Total","Females","Males")[1],
    countries= convergEU_glb()$EU28$memberStates$codeMS
    )
print(ddTB4$res,n=35,width=250)
sourceFile1 <- system.file("extdata", package = "convergEU")
# save(ddTB4,file=file.path(sourceFile1,"ddTB4.RData"))
load(file.path(sourceFile1,"ddTB4.RData"))
print(ddTB4$res,n=35,width=250)

...where error messages are not shown (the \$err list component is empty).

The print output tells us that missing values are present. First, it is convenient to assign a meaningful name to the downloaded data:

JQIinte <- ddTB4$res 

More features may be investigated as usual with common R functions, like the dimension of the dataset:

dim(JQIinte)

... that is, $5$ rows and $30$ columns. Here's how to check variable names:

names(JQIinte)

A dataset can't have qualitative variables, vectors of strings, or missing values for computing convergence measures. A time variable should also be present, and if the name is not "time", than it must be passed during function calls as an argument to have proper data processing. The check_data() function checks for unsuited features that must be solved before starting the analysis. The object returned states if the dataset is ready for calculations, and if it is not, the error component states why checking failed: * "Error: one or more missing values in the dataframe."
* "Error: qualitative variables in the dataframe."
* "Error: string variables in the dataframe."
* "Error: timeName variable absent."
* "Error: time variable is not ordered."

For example, with the JQIinte data we have:

check_data(JQIinte,timeName="time")

Missing values are present, thus imputation is required. This can be done by using the impute_dataset function:

JQIinteImp <- impute_dataset(JQIinte, timeName = "time",
                          countries=convergEU_glb()$EU28$memberStates$codeMS,
                          tailMiss = c("cut", "constant")[2],
                          headMiss = c("cut", "constant")[2])$res 
print(JQIinteImp,n=35,width=250)

\$res was added to the function call in order to use only the tibble containing years by countries. The imputation selected for the first (tail) and last (head) years is "constant", thus the first not missing value is propagated to missing years, but the alternative of cutting all years in which one or more missing are presents may be selected with the arguments:

tailMiss = "cut",headMiss = "cut"

The following code can be used to determine which years a country (here HR) is missing values for:

select(filter(JQIinte, is.na(HR)),time,HR) 

This code does the same thing by using pipes (and the magrittr package):

JQIinte %>% 
  filter(is.na(HR)) %>%
  select(time,HR)

The check_data function is called again but on imputed data:

check_data(JQIinteImp)

...where the suspected string variable is sex:

JQIinteFin <- dplyr::select(JQIinteImp,-sex)
check_data(JQIinteFin)

...and thus JQIinteFin is the final object to start the data analysis.

The tidyverse functions mutate, select, filter are extremely useful for further selection and inspection, and can be used with or without the forward-pipe operator (%>%).

Note that before starting the analysis, the number of digits may be selected. For example, the above tibble can be rounded to integers by invoking the function round() where digits = 0:

JQIinteFin[, -1] <- round(select(JQIinteFin,- time), digits = 0)
JQIinteFin

Imputing missing values using a straight line

The basic imputation method is deterministic, like in the average of two interval endpoints separated by just one year missing. If several missing values are present in a row, a linear change of an indicator is assumed over time between the two observed time points flanking a chunk of missing values.

intervalTime <-  c(1999,2000,2001,2002,2003) 
intervalMeasure <- c( 66.5, NA,NA,NA,87.2) 
currentData <- tibble(time= intervalTime, veval= intervalMeasure) 
currentData
resImputed <- impute_dataset(currentData,
                           countries = "veval",
                           timeName = "time",
                           tailMiss = c("cut", "constant")[2],
                           headMiss = c("cut", "constant")[2]) 
resImputed$res  

In the figure below, grey points are imputed using observed values represented by solid blue points:

tmp <-  as.data.frame(currentData[ c(1,5),] )
tmp2 <- as.data.frame(resImputed$res[2:4,] )

myg <- ggplot(as.data.frame(resImputed$res),  mapping=aes(x=time,y=veval)) + 
  geom_point() + 
  geom_line(data=resImputed$res,col="red") + 
  geom_point(data=tmp,mapping=aes(x=time,y=veval), 
              size=4, 
              colour="blue")  + 
  geom_point(data= tmp2, 
             aes(x=time,y=veval),size=4,alpha=1/3,col="black") + 
  xlab("Time") + ylab("Measure / Index") +  
  ggtitle( "Blue points are observed values,\n grey points are missing) \n") 

myg 

It is important to note that $10\%$ of missing values for a country may already require substantive justification before interpreting results after imputation. It is up to the user whether they proceed with the analysis after checking the missingness of their data.

Weighted average smoothing of a complete dataset

This section focuses on smoothing values observed in the considered time interval. A smoothing procedure removes sudden large changes, showing a less variable time series than the original. Given that short time series (panel data) are considered here, a three-point weighted average is proposed. The smoothing procedure substitutes an original raw value $y_{m,i,t}$ of country $m$ indicator $i$ at time $t$ with the weighted average $$\check{y}{m,i,t} = y{m,i,t-1} ~ (1-w)/2 +w ~y_{m,i,t} +y_{m,i,t+1} ~(1-w)/2$$ where $0< w \leq 1$. The special case $w=1$ corresponds to no smoothing. In case of missing values, an NA is returned. If the weight is outside the interval $(0,1]$, an NA is returned. The first and last values are smoothed using weights $w$ and $1-w$.

After loading the data, imputation and smoothing should be performed. The example below with countries IT and DE illustrates how to do so. First, check if missing values are present:

workTB <- dplyr::select(emp_20_64_MS, time, IT,DE)
check_data(workTB)

If the dataframe passes the check, proceed to the smoothing step. Delete the time variable:

resSM <- smoo_dataset(select(workTB,-time), leadW = 0.149, timeTB= select(workTB,time))
resSM

For a comparison:

compaTB<-bind_cols(workTB, select(resSM,-time))
names(compaTB)<-c("time", "IT","IT1","DE", "DE1")
compaTB

A graphical output shows changes for "IT", with the original index in blue and the smoothed index in red:

qplot(time,IT,data=compaTB) + 
  geom_line(colour="navyblue") +
  geom_line(aes(x=time,y=IT1),colour="red") +
  geom_point(aes(x=time,y=IT1),colour="red",shape=8)

Similarly, for DE:

qplot(time,DE,data=compaTB) + 
  geom_line(colour="navyblue") +
  geom_line(aes(x=time,y=DE1),colour="red") +
  geom_point(aes(x=time,y=DE1),colour="red",shape=8)

A weight equal to 1 leaves the data unchanged:

resSM <- smoo_dataset(select(workTB,-time), leadW = 1, timeTB= select(workTB,time))
compaTB <- bind_cols(workTB, select(resSM,-time))
names(compaTB)<-c("time", "IT","IT1","DE", "DE1")
qplot(time,IT,data=compaTB) + 
  geom_line(colour="navyblue") +
  geom_line(aes(x=time,y=IT1),colour="red") +
  geom_point(aes(x=time,y=IT1),colour="red",shape=8)

A time window larger than $3$ could be considered, but please consider that profound economic and social changes can occur in the EU over $3$ years.

Moving average smoother

Several alternative smoothing algorithms are available in R. Classical ma smoothers are described in the caTools package.

The emp_20_64_MS dataset is now chosen with Italy as the country operations will be showcased upon.

data(emp_20_64_MS)
cuTB <- select(emp_20_64_MS,time)
cuTB <- mutate(cuTB,ITori =emp_20_64_MS$IT)

At the beginning and end of this series values are averages calculated on a smaller and smaller number of observations (tails):

cuTB <- mutate(cuTB, IT_k_3= runmean(emp_20_64_MS$IT, k=3, 
        alg=c("C", "R", "fast", "exact")[4],
        endrule=c("mean", "NA", "trim", "keep", "constant", "func")[4],
        align = c("center", "left", "right")[1]))

cuTB <-  mutate(cuTB, IT_k_5= runmean(emp_20_64_MS$IT, k=5, 
        alg=c("C", "R", "fast", "exact")[4],
        endrule=c("mean", "NA", "trim", "keep", "constant", "func")[4],
        align = c("center", "left", "right")[1]))

cuTB <-  mutate(cuTB, IT_k_7= runmean(emp_20_64_MS$IT, k=7, 
        alg=c("C", "R", "fast", "exact")[4],
        endrule=c("mean", "NA", "trim", "keep", "constant", "func")[4],
        align = c("center", "left", "right")[1]))

The options alg,endrule,align in the runmean function are discussed in the caTools package.

The figure below shows results for different degrees of smoothing: original (black), k=3 (red), k=5 (blue), k=7 (orange).

myG <- ggplot(cuTB,aes(x=time,y=ITori))+geom_line()+geom_point()+
       geom_line(aes(x=time,y=IT_k_3),colour="red")+
       geom_point(aes(x=time,y=IT_k_3),colour="red")+
       #
       geom_line(aes(x=time,y=IT_k_5),colour="blue")+
       geom_point(aes(x=time,y=IT_k_5),colour="blue")+
       #
       geom_line(aes(x=time,y=IT_k_7),colour="orange")+
       geom_point(aes(x=time,y=IT_k_7),colour="orange")

myG

It is typically the case that the time series is so short that at $k=7$ a lot of observations are smoothed with different number of observations (shorter at start and end).

The above calculations can be performed as shown below in the convergEU package:

cuTB <-  emp_20_64_MS[,c("time","IT","DE")]
ma_dataset(cuTB, kappa=3, timeName= "time")

This approach is a bit less customizable, but it leads to a standard tidy dataset.


Absolute change

Absolute change for country $m$, indicator $i$ at time $t$ is defined as: $$ \Delta y_{m,i,t} = y_{m,i,t} - y_{m,i,t-1} $$

This can be calculated with the convergEU package with the function abso_change. In the emp_20_64_MS dataset, which is tidy and has no missing values:

data(emp_20_64_MS)
mySTB <- abso_change(emp_20_64_MS, 
                        time_0 = 2005, 
                        time_t = 2010,
                        all_within=TRUE,
                        timeName = "time")
names(mySTB$res)

The equation above results in:

mySTB$res$abso_change

If desired, less digits may be displayed. For instance, by rounding we get:

round(dplyr::select(mySTB$res$abso_change,AT:UK), 5)

The sum of absolute values $$ \sum_{t=t_0+1}^{} | \Delta y_{m,i,t}|
$$ is:

round(mySTB$res$sum_abs_change,4)

Such sum can be divided by the number of pairs of years so that the result is an average per pair of years:

round(mySTB$res$average_abs_change,4)


Summaries and clusters of countries

The unweighted average of country values is an important summary statistic. The possible sets of countries that can be specified are stored within the function generating global static objects and tables, called convergEU_glb(). Below we showcase how to use this function with the emp_20_64_MS dataset.

First note that the EU area includes the following MS:

convergEU_glb()$Eurozone
convergEU_glb()$EU19

The labels for the 28 MS are:

convergEU_glb()$EU28

The list of known MS labels is:

names(convergEU_glb())[3:7]

For example, the unweighted average in the emp_20_64_MS dataset for the EU28 would be:

average_clust(emp_20_64_MS, 
              timeName = "time",
              cluster = "EU28")$res[,c(1,30)]

while for EU12 it would be:

average_clust(emp_20_64_MS,
              timeName = "time",
              cluster = "EU12")$res[,c(1,30)]

An unknown label, like "EUspirit", would cause a computational error:

average_clust(emp_20_64_MS,timeName = "time",cluster = "EUspirit")

...as would an incorrect time name:

average_clust(emp_20_64_MS,timeName = "TTime",cluster = "EA")

The time series can be plotted as shown:

wwTB <- average_clust(emp_20_64_MS,timeName = "time",cluster = "EU28")$res[,c(1,30)]
mini_EU <- min(wwTB$EU28)
maxi_EU <- max(wwTB$EU28)

qplot(time, EU28, data=wwTB,
      ylim=c(mini_EU,maxi_EU))+geom_line(colour="navy blue")+
      ylab("emp_20_64")


The analysis of convergence

Several measures of convergence have been recently proposed by Eurofound (Eurofound, 2018). In this section, each each measure is introduced and its usage showcased.

Beta-convergence

Let's assume that we have a tidy dataset (tibble) in the form years by countries. The calculations for beta convergence are performed according the following linear model: $$ \tau^{-1}(ln(y_{m,i,t+\tau})-ln(y_{m,i,t})) = \beta_0 + \beta_1 ln(y_{m,i,t}) +\epsilon_{m,i,t} $$ where $m$ represents an EU Member State (country), $i$ refers to an indicator of interest, $t$ is the reference time and $\tau \in {1,2,\ldots}$ the length of the time window (typically $1$ or more years).

The output of beta_conv() is a list in which transformed data, the point estimate of $\beta_1$ and a standard two-tails test (including the p-value and adjusted R squared) is reported . While it is not implemented in the package, a one-tail test $H_0: \beta_1 \geq 0$ against $H_1: \beta1< 0$ may also be used. In the implementation of the function beta_conv(), the same reference time is maintained across different years. The division of the left hand side by the amount of time elapsed can be skipped by passing the argument useTau = FALSE.

Below is an example on how to invoke the function:

require(ggplot2)
require(dplyr)
require(tibble)

empBC <- beta_conv(emp_20_64_MS, 
                 time_0 = 2002, 
                 time_t = 2006, 
                 all_within = FALSE, 
                 timeName = "time")
empBC

Note that all_within = FALSE is the default.

A plot of transformed data and the regression line can be obtained by running:

qplot(empBC$res$workTB$indic,
      empBC$res$workTB$deltaIndic,
      xlab="log-Indicator",
      ylab="Delta-log-indicator") +
  geom_abline(intercept = as.numeric(empBC$res$summary[1,2]),
              slope = as.numeric(empBC$res$summary[2,2]),
              colour = "red") +
  geom_text(aes(label=empBC$res$workTB$countries),
            hjust=0, vjust=0,colour="blue")

Labels are replicated as many times as the number of included years if all_within=TRUE was specified. Furthermore, note that if the value of the indicator at the start or end time were 0, calculating beta convergence would be impossible (since the log of 0 is not defined). To bypass this and allow the calculation of beta-convergence, a very small constant (equal to a hundredth of the smallest value in the dataset) is added to the indicator where it equals 0. This allows the calculation of beta convergence and should not affect the outcome of the analysis.


Sigma-convergence

The key concempt in sigma-convergence is variability with respect to the mean. Let $Y_{m,i,t}$ be the value of indicator $i$ for member state $m$ at time $t$, and $\overline{Y}_{A,i,t}$ the average over aggregation $A$. If $A = EU28$, then:

For each year, the summaries above are calculated to let the user see if convergence (i.e., a reduction in heterogeneity) took place. Below is how to test for sigma convergence across the entire time interval in the dataset:

mySTB <- sigma_conv(emp_20_64_MS, timeName="time")
mySTB

It is also possible to specify a time interval:

sigma_conv(emp_20_64_MS, time_0 = 2002, time_t = 2004)

The departure from the mean, where $-1,0,1$ indicates values respectively below $-1$ within the interval $(-1,1)$ and above $+1$, can be characterized like so:

res <- departure_mean(oriTB = emp_20_64_MS, sigmaTB = mySTB$res)
names(res$res)
res$res$departures

Details on the contribution of each MS to the variance at a given time $t$ is evaluated by the square of the difference $(Y_{m,i,t} - \overline{Y}_{EU28,i,t})^2$ between the indicator $i$ of country $m$ at time $t$ and the unweighted average of the member states. These can be can be obtained by running:

res$res$squaredContrib

It is also possible to decompose the numerator of the variance, called deviance, at each time in order to calculate the percent contributed by each MS to the total deviance for indicator $i$ of country $m$ at time $t$. $$ 100 \cdot \frac{(Y_{m,i,t} - \overline{Y}{EU28,i,t})^2}{\sum{m} (Y_{m,i,t} - \overline{Y}_{EU28,i,t})^2} $$

res$res$devianceContrib

Notice that each row adds to $100$.

It is possible to produce a graphical output about the main features of a country's time series, as shown below:

myGG <- graph_departure(res$res$departures,
                timeName = "time",
                displace = 0.25,
                displaceh = 0.45,
                dimeFontNum = 4,
                myfont_scale = 1.35,
                x_angle = 45,
                color_rect = c("-1"='red1', "0"='gray80',"1"='lightskyblue1'),
                axis_name_y = "Countries",
                axis_name_x = "Time",
                alpha_color = 0.9
                )
myGG

Any selection of countries is feasible:

#myWW1<- warnings()
myGG <- graph_departure(res$res$departures[,1:10],
                timeName = "time",
                displace = 0.25,
                displaceh = 0.45,
                dimeFontNum = 4,
                myfont_scale = 1.35,
                x_angle = 45,
                color_rect = c("-1"='red1', "0"='gray80',"1"='lightskyblue1'),
                axis_name_y = "Countries",
                axis_name_x = "Time",
                alpha_color = 0.29
                )

myGG

Gamma-convergence

We now introduce gamma convergence by an index based on ranks.

Let $y_{m,i,t}$ be the value of indicator $i$ for member state $m$ at time $t=0,1,\ldots, T$, and ${ \tilde{y}{m,i,t}: m \in A )$ the ranks for indicator $i$ over member states in the reference set $A$, for example $A = EU28$, at a given time $t$. The sum of ranks within member state $m$ is: $$ \tilde{y}^{(s)}{m,i} = \sum_{t=0}^T \tilde{y}{m,i,t} $$ The variance of the sum of ranks over the given interval $$ Var\left[ {\tilde{y}^{(s)}{m,i}: m \in A } \right] $$ may be compared to the variance of ranks in the reference time $t=0$: $$ Var\left[ {\tilde{y}_{m,i,0}: m \in A } \right] $$

The Kendall index KI, with respect to cluster $A$ of member states for the indicator $i$ over a given time interval is: $$ KI(A,i,T) = \frac{Var\left[ {\tilde{y}^{(s)}{m,i}: m \in A } \right] }{ (T+1)^2 ~~Var\left[{\tilde{y}{m,i,0}: m \in A }\right] } $$

The measure of gamma-convergence is obtained with the following function:

gamma_conv(emp_20_64_MS,last=2016,ref=2002,timeName="time")

Delta-convergence

Let $y_{m,i,t}$ be the value of indicator $i$ for MS $m$ at time $t$, and $y^{(M)}{i,t}$ the maximum value over member states in the reference set $A$ (e.g., $A = EU28$): $$ y^{(M)}{i,t} = max({ y_{m,i,t}: m \in A}) $$

The distance of MS $m$ from the top performer at time $i$ is: $$ y^{(M)}{i,t} - y{m,i,t} $$ The overall distance at time $t$, called delta, is the sum of distances over the reference set $A$ for the considered indicator $i$. $$ \delta_{i,t} = \sum_{m \in A} (y^{(M)}{i,t} - y{m,i,t}) $$

Delta-convergence can be calculated as follows:

delta_conv(emp_20_64_MS,"time")


Automated production of scoreboard and fiches

The convergEU package allows the user to produce scoreboards and fiches as HTML or pdf files in an automated way.


Scoreboards

Scoreboards showcase the raw values of an indicator (level, $y_{m,i,t}$) for MS $m$ at time $t$ for indicator $i$. The difference between years, i.e. the change: $$ y_{m,i,t} - y_{m,i,t-1} $$ can be calculated by following the example below.

We can produce the scoreboard for the dataset emp_20_64_MS with the following:

data(emp_20_64_MS)
resTB <- scoreb_yrs(emp_20_64_MS,timeName = "time")
resTB

The result is a list with three components: the summary statistics, the numerical labels indicating the interval of the partition a level belongs to, *the interval of the partition a change belongs to.

Numerical labels are assigned as follows, see (DRAFT JOINT EMPLOYMENT REPORT FROM THE COMMISSION AND THE COUNCIL 2019 :
value $1$ if a the original level or change is $y \leq m -1 \cdot s$;
value $2$ if a the original level or change is $m -1\cdot s < y \leq m - 0.5\cdot s$;
value $3$ if a the original level or change is $m - 0.5\cdot s< y \leq m +0.5\cdot s$;
value $4$ if a the original level or change is $m +0.5\cdot s< y \leq m + 1\cdot s$;
* value $5$ if a the original level or change is $y > m +1\cdot s$.

We note that there is the possibility of representing the above summaries as coloured plots (TO DO) into scoreboards.

For the comparison of a country with the EU average, the following steps are recommended, from raw data:

# require(ggplot2)
# data(emp_20_64_MS)
selectedCountry <- "IT"
timeName <-  "time"
myx_angle <-  45

outSig <- sigma_conv(emp_20_64_MS, timeName = timeName,
           time_0=2002,time_t=2016)
miniY <- min(emp_20_64_MS[,- which(names(emp_20_64_MS) == timeName )])
maxiY <-  max(emp_20_64_MS[,- which(names(emp_20_64_MS) == timeName )])
estrattore<-  emp_20_64_MS[,timeName] >= 2002  &  emp_20_64_MS[,timeName] <= 2016
ttmp <- cbind(outSig$res, dplyr::select(emp_20_64_MS[estrattore,], -contains(timeName)))

myG2 <- 
  ggplot(ttmp) + ggtitle(
  paste("EU average (black, solid) and country",selectedCountry ," (red, dotted)") )+
  geom_line(aes(x=ttmp[,timeName], y =ttmp[,"mean"]),colour="black") +
  geom_point(aes(x=ttmp[,timeName],y =ttmp[,"mean"]),colour="black") +
#        geom_line()+geom_point()+
    ylim(c(miniY,maxiY)) + xlab("Year") +ylab("Indicator") +
  theme(legend.position = "none")+
  # add countries
  geom_line( aes(x=ttmp[,timeName], y = ttmp[,"IT"],colour="red"),linetype="dotted") + 
  geom_point( aes(x=ttmp[,timeName], y = ttmp[,"IT"],colour="red")) +
  ggplot2::scale_x_continuous(breaks = ttmp[,timeName],
                     labels = ttmp[,timeName]) +
   ggplot2::theme(
         axis.text.x=ggplot2::element_text(
         #size = ggplot2::rel(myfont_scale ),
         angle = myx_angle 
         #vjust = 1,
         #hjust=1
         ))

myG2

It is also possible to graphically show departures in terms of the above defined partition:

obe_lvl <- scoreb_yrs(emp_20_64_MS,timeName = timeName)$res$sco_level_num
# select subset of time
estrattore <- obe_lvl[,timeName] >= 2009 & obe_lvl[,timeName] <= 2016  
scobelvl <- obe_lvl[estrattore,]

my_MSstd <- ms_dynam(scobelvl,
                timeName = "time",
                displace = 0.25,
                displaceh = 0.45,
                dimeFontNum = 3,
                myfont_scale = 1.35,
                x_angle = 45,
                axis_name_y = "Countries",
                axis_name_x = "Time",
                alpha_color = 0.9
                )   

my_MSstd


Country fiches

The convergEU package provides a function that automatically prepares one or more country fiches. The function allows the user to pass arguments to obtain information on various indicators and countries. The user specifies one key country and some other countries of interest can be listed to compare performances. Note that most arguments should be passed as strings instead of object names and that the dataset must be a complete (without missing values) tibble. Internet connection should be available when invoking the function to properly render the results.

Below is an example of a call to the function go_ms_fi() to illustrate the syntax. This command would create a country fiche for Germany comparing it with Italy, the UK and France for the timeframe 2002-2016. Note that most arguments are passed as strings instead of object names.

go_ms_fi(
    workDF ='myTB',
    countryRef ='DE',
    otherCountries = "c('IT','UK','FR')",
    time_0 = 2002,
    time_t = 2016,
    tName = 'time',
    indiType = "highBest",
    aggregation = 'EU27',
    x_angle = 45,
    dataNow = Sys.time(),
    author = 'A.Student',
    outFile = 'Germany-up2-2016', 
    outDir = "/media/fred/STORE/PRJ/2018-TENDER-EU/STEP-1/bitbucketed/tt-fish",
    indiName = 'emp_20_64_MS',
    workTB = NULL
)

Here is a breakdown of the arguments passed in the function: workDF is a string specifying the name of the working dataset that must be available in the global environment. countryRef is a string determining the country (or unit) of main interest. This country will be shown in one-country plots. the short name of a member country that will be shown in one-country plots. otherCountries specifies other countries that should be included in the analysis for comparison. time_0 specifies the starting time. time_t specifies the end time. tName is the name of the variable containing times. indiType specifies whether the indicator is of type "lowBest" or "highBest" (i.e. if a low or high value is desirable for a country). aggregation specifies the reference group of EU countries (e.g., 'EU27', 'EA'). If using a dataset with units other than EU Member States, then this should be 'custom'. x_angle determines the axis orientation for time labels in graphs. dataNow specifies the date of creation of the country fiche. Sys.time() will provide the exact date and time of when the function was run. author specifies the author of the report, which will be shown in the fiche. outFile is a string determining the name of the output file. This should not include a path. outDir determines the output directory, eventually not existing (only one level allowed). indiName is a string determining how the name of the considered indicator will appear in the fiche. workTB* is the name of a tibble containing data, optional, as an alternative to a global object.

Of particular importance the argument outFile that can be a string indicating the name of the output file. Similarly, outDir is the path (unit and folders) in which the final compiled html will be stored. The sintax of the path depend on the operating system; for example outDir='F:/analysis/IT2018' indicates that in the usb disk called 'F', within the folder 'analysis' is located folder 'IT2018' where R will write the country fiche. Note that a disk called 'F' must exist and also folder 'analysis' must exist in such unit, while on the contrary folder 'IT2018' is created by the function if it does not already exist.

Within the above mentioned output directory, besides the compiled HTML, a file called as specified in outFile is also stored, but with added the string '-workspace.RData' which contains data and plots produced during the compilation of the country fiche for further subsequent use in other technical reports.


Indicator fiches

The function go_indica_fi() allows the user to create indicator fiches in the form of an HTML or PDF file. Note that most arguments should be passed as strings instead of object names and that the dataset must be a complete (without missing values) tibble. Internet connection should be available when invoking the function to properly render the results.

An example of syntax to invoke the procedure is:

go_indica_fi(
    time_0 = 2005,
    time_t = 2010,
    timeName = 'time',
    workDF = 'emp_20_64_MS' ,
    indicaT = 'emp_20_64',
    indiType = c('highBest','lowBest')[1],
    seleMeasure = 'all',
    seleAggre = 'EU28',
    x_angle =  45,
    data_res_download =  FALSE,
    auth = 'A.Student',
    dataNow =  '2019/05/16',
    outFile = "test_IT-emp_20_64_MS",
    outDir = "/media/fred/STORE/PRJ/2018-TENDER-EU/STEP-1/bitbucketed/tt-fish",
    pdf_out = FALSE,
    workTB = NULL,
    selfContained = FALSE,
    eige_layout = FALSE,
    memStates = 'quintiles'# ('quintiles', 'default', 'custom')
)
  )

Here is a breakdown of the arguments passed in the function: time_0 specifies the starting time. time_t specifies the end time. timeName is the name of the variable containing times. workDF is a string specifying the name of the working dataset that must be available in the global environment. indicaT is a string determining how the name of the considered indicator will appear in the fiche. indiType specifies whether the indicator is of type "lowBest" or "highBest" (i.e. if a low or high value is desirable for a country). seleMeasure determines which measures of convergence will be calculated. This is a subset of the following collection of strings: "all", beta","delta", "gamma","sigma". If uncertain, we recommend using "all", as it is a shortcut for the whole set. seleAggre specifies the set of EU countries (e.g., 'EU27', 'EU19') for which the analysis should be run. If using a dataset with units other than EU Member States, then this should be 'custom'. x_angle determines the axis orientation for time labels in graphs, the default is 45. data_res_download determines whether the data and results should be downloaded, the default is FALSE. author specifies the author of the report, which will be shown in the fiche. auth specifies the author of the report, which will be shown in the fiche. The default is 'A.Student'. dataNow specifies the date of creation of the indicator fiche. The default is the current date and time. outFile is a string determining the name of the output file. This should not include a path. outDir determines the output directory, eventually not existing (only one level allowed, in other words it cannot create a folder and a sub-folder, the folder should already exist and the subfolder will be created if specified in the path). pdf_out lets the user choose whether to create the fiche as an HTML or a PDF. The default is FALSE. If passed as TRUE, then the fiche will be created also as a PDF file. workTB is the name of a tibble containing data, optional, as alternative to a global object. selfContained should be set to TRUE if just one file is desired. eige_layout should be set to TRUE if the EIGE (European Institute for Gender Equality) layout is desired. memStates determines what kinds of visualisations and analyses are included in the fiche when comparing units. There are three options: "default", "quintiles", and "custom".



References

Below the main references are listed:









Try the convergEU package in your browser

Any scripts or data that you put into this service are public.

convergEU documentation built on May 29, 2024, 11:15 a.m.