require(ggplot2)
require(dplyr)
require(tidyverse)
require(convergEU)
require(eurostat)
require(purrr)
require(tibble)
require(tidyr)
require(ggplot2)
require(formattable) 
require(kableExtra)
require(caTools)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)




Introduction

The convergEU R package is a set of S3 functions and data objects suited for the analysis of economic and social convergence of Member States (MS) withing the European Union (EU).

This vignette is intended to be a gentle introduction to the analysis of convergence performed with convergEU suite of functions.

Data from Eurofound (local) and Eurostat (download) are available from within the package with little effort. Nevertheless, data objects may also be created from scratch, following the R syntax to create or import datasets.

A dataset created or processed in this package is almost always a tibble, therefore at least some confidence with the dplyr package (online dplyr site) is convenient, although in some cases a bit more of the tidyverse (https://www.tidyverse.org/) is exploited.
For a general introduction see "R for Data Science" (online R4DS) by Wickham and Grolemund.

Imported or download data may or may not be in a tidy shape, here defined as a rectangular table with ordered years in the first column and with two or more MS in subsequent columns. Such a data structure may implicitly refer to males and/or a given age-class, a fact that may or may not be evident through further columns in the tibble.

The most common data to process are indicators downloaded from EU repositories, thus it is safe to remember that an indicator may represent a better performance-situation of a country for high values (highBetter) or when it takes very low values (lowBetter). When this is a critical feature to properly process data through a function, a specific argument must be provided, that is one string chosen between the two possibilities "highBetter" and "lowBetter".

One general advice pertains the member states included into a dataset (tibble object) as an input to a function. It is always the case that a function that calculates a summary over a set of EU countries (like the average) must receive indicators for all those countries within the dataset in input, even when the interest is focused on just two or three countries among them.

A general feature of results produced by most functions in the convergEU package is that a list with metainformation is returned. It has three components: \$res, \$msg, \$err. The first list component, \$res, is the actual result, if computed. The second component, \$msg is a message decorating the computed result, possibly a warning. The third component, \$err, is an error message or a list of errors when a result is not computed.

The R packages exploited in this vignette are:

require(convergEU)
require(ggplot2)
require(dplyr)
require(tidyverse)
require(eurostat)
require(purrr)
require(tibble)
require(tidyr)
require(ggplot2)
require(formattable) 
require(caTools)


Loading and preparing data

Two types of data sources are considered here because they are straightforward to work with in the convergEU package: data produced by Eurofound, statistically available without and active internet connection, and Eurostat data that can be downloaded on the fly is while an internet connection is active.

Locally accessible datasets: Eurofound data

Some datasets are accessible from package convergEU using the R function data(), for example:

library(convergEU)
data("emp_20_64_MS",package = "convergEU")
head(emp_20_64_MS)

Eurofound datasets EWCS and EQLS are locally available within convergEU, see:

data(package = "convergEU")

The object dbEUF2018meta contains a description (metadata) of locally available data from within the convergEU package, without any downloading from Eurofound website.

print(dbEUF2018meta, n=200,width=200)             

The raw local Eurofound database is accessed as follows:

require(convergEU)
data(dbEurofound)
head(dbEurofound)

where variable names are:

names(dbEurofound)

and the time ranges in the interval:

c(min(dbEurofound$time), max(dbEurofound$time))

The database is not complete in such a time range for all considered countries.

NOTE: within convergeEU package, Eurofound data are statically stored. Please update this package to have the most recent version of Eurofound data.

A special role is played by the Eurofound dataset exploited during the development of this package: emp_20_64_MS. This is the tidy version derived from dataset emp_20_64 on the employment within age class 20 to 64:

help(emp_20_64_MS)

thus such dataset is always ready for the analysis without further processing.

All other indicators are extracted from the Eurofound database in the first step of an analysis: this is the data preparation step. It amounts to choose a time interval, an indicator and a set of countries (MS, Member States), for example:

convergEU_glb()$EU12

among those available:

names(convergEU_glb())[c(3:8)]

Please note that EU19 and Eurozone are synonims:

convergEU_glb()$EU19

The general procedure to select an indicator is illustrated below for the "lifesatisf" indicator taken from the column "Code_in_database" within the object dbEUF2018meta that contains the meta information:

head(dbEUF2018meta)
myTB <- extract_indicator_EUF(
    indicator_code = "lifesatisf", #Code_in_database
    fromTime=2003,
    toTime=2016,
    gender= c("Total","Females","Males")[2],
    countries= convergEU_glb()$EU12$memberStates$codeMS
    )

myTB

which results in a dataset ready for further analyses in the list component named "$res"; further components are "\$msg" that possibly carries messages for the user and "\$err" which is a string containing an error message, if an error occurs.

Downloadable data: Eurostat repository

Eurostat data are downloaded on the fly from an active internet connection.

The heterogeneity in the structure of different indicators normalized into the database requires some attentions. A list of covariates for each indicator is sometimes present besides age and gender thus their values must be set to produce a tidy dataset time by countries.

First, raw data may be downloaded using the option rawDump=T:

ddTB1 <- download_indicator_EUS(
      indicator_code= convergEU_glb()$metaEUStat$selectorUser[1],
      fromTime = 2005,
      toTime = 2015,
      gender= c(NA,"T","F","M")[2],#c("Total","Females","Males")
      countries =  convergEU_glb()$EU28$memberStates$codeMS,
      rawDump=T )
ddTB1
sourceFile1 <- system.file("extdata", package = "convergEU")
# save(ddTB1,file=file.path(sourceFile1,"ddTB1.RData"))
load(file.path(sourceFile1,"ddTB1.RData"))
ddTB1

which is not a tidy dataset. Note that unit and isced11 are auxilary valiables specific for this indicator and that some filtering must be performed to obtain a tidy dataset years by countries. At this purpose, the argument rawDump=F indicates that bulk data are filtered and reshaped, as shown below:

ddTB2 <- download_indicator_EUS(
      indicator_code= convergEU_glb()$metaEUStat$selectorUser[1],
      fromTime = 2005,
      toTime = 2015,
      gender= c(NA,"T","F","M")[1],#c("Total","Females","Males")
      ageInterv = NA,
      countries =  convergEU_glb()$EU28$memberStates$codeMS,
      rawDump=F,
      uniqueIdentif = 1)

convergEU_glb()$metaEUStat$selectorUser[1]
ddTB2
convergEU_glb()$metaEUStat$selectorUser[1]
sourceFile1 <- system.file("extdata", package = "convergEU")
# save(ddTB2,file=file.path(sourceFile1,"ddTB2.RData"))
load(file.path(sourceFile1,"ddTB2.RData"))
ddTB2

where convergEU_glb()\$EU28\$memberStates\$codeMS is a vector of strings for the considered countries, convergEU_glb()\$metaEUStat\$selectorUser[1] contains the name of the indicator and where ageInterv may take a value when an age interval has to be specified for a given indicator. The result is a list with the following components:

It is therefore possible to call several times the same function and specify the argument uniqueIdentif as an integer among those in the first column left of \$msg\$Further_Conditioning\$available_seleTagLs to obtain the same indicator under differt scales and contexts. For example the fifth conditioning context is for males in age interval "Y15-64" is:

ddTB3 <- download_indicator_EUS(
      indicator_code= convergEU_glb()$metaEUStat$selectorUser[1],
      fromTime = 2005,
      toTime = 2015,
      gender= "M",
      ageInterv = "Y15-64",
      countries =  convergEU_glb()$EU28$memberStates$codeMS,
      rawDump=F,
      uniqueIdentif = 5)
ddTB3
sourceFile1 <- system.file("extdata", package = "convergEU")
# save(ddTB3,file=file.path(sourceFile1,"ddTB3.RData"))
load(file.path(sourceFile1,"ddTB3.RData"))
ddTB3


Data preparation: from data structure to imputation of missing values

The analysis of convergence is performed on clean and imputed data: a tidy dataset years by countries. The first step after downloading data is the description of the main features of such a dataset.

An illustrative example follows with the indicator "JQIintensity_i":

# print(dbEUF2018meta[11,],n=20,width=100)
t(dbEUF2018meta[11,])

First the raw dataset is downloaded:

ddTB4 <- extract_indicator_EUF(
    indicator_code = "JQIintensity_i", #Code_in_database
    fromTime= 1965,
    toTime=2016,
    gender= c("Total","Females","Males")[1],
    countries= convergEU_glb()$EU28$memberStates$codeMS
    )
print(ddTB4$res,n=35,width=250)
sourceFile1 <- system.file("extdata", package = "convergEU")
# save(ddTB4,file=file.path(sourceFile1,"ddTB4.RData"))
load(file.path(sourceFile1,"ddTB4.RData"))
print(ddTB4$res,n=35,width=250)

where error messages are not shown (\$err list component empty).

The inspection of the print output reveals that missing values are present. More features may be investigated as usual with common R functions, like dimension and variable names, but first it is convenient to assign a meaningfull name to the downloaded data:

JQIinte <- ddTB4$res 
dim(JQIinte)

that is $5$ rows and $30$ columns, and variable names:

names(JQIinte)

A dataset can't have qualitative variables, neither vector of strings nor missing values for computing convergence measures. A time variable should also be present, and if the name is not "time", than it must be passed during function calls as an argument to have proper data processing. The check_data() function may called to check for the presence of unsuited features that must be solved before starting the analysis. The object returned states if the dataset is ready for calculations, and if it is not, the error component states why checking failed:

For example, with the JQIinte data we have:

check_data(JQIinte,timeName="time")

missing values are present thus missing imputation is required by using the impute_dataset function:

JQIinteImp <- impute_dataset(JQIinte, timeName = "time",
                          countries=convergEU_glb()$EU28$memberStates$codeMS,
                          tailMiss = c("cut", "constant")[2],
                          headMiss = c("cut", "constant")[2])$res 
print(JQIinteImp,n=35,width=250)

where \$res was added to the function call in order to extract just the list component of interest. The imputation selected for the first (tail) and last (head) years is "constant", thus the first not missing value is propagated to missing years, but the alternative of cutting all years in which one or more missing are presents may be selected with the arguments:

tailMiss = "cut",headMiss = "cut"

In case more information is needed on years where missing values are located for a given country, say HR, the following simple code helps:

select(filter(JQIinte, is.na(HR)),time,HR) 

or by using pipes, after magrittr package:

JQIinte %>% 
  filter(is.na(HR)) %>%
  select(time,HR)

The check_data function is called again but on imputed data:

check_data(JQIinteImp)

where the suspected string variable is sex:

JQIinteFin <- dplyr::select(JQIinteImp,-sex)
check_data(JQIinteFin)

thus JQIinteFin is the final object to start the data analysis.

The tidyverse functions mutate, select, filter are the workhorses for more elaborated selection and inspections, with or without the use of the new forward-pipe operator (%>%).

Note that before starting the analysis, the number of digits maye be selected, for example, if rounding to integer is preferred, the above tibble must be changed by invoking the function round() where digits = 0:

JQIinteFin[, -1] <- round(select(JQIinteFin,- time), digits = 0)
JQIinteFin

Imputing missing values using a straight line

The basic imputation method is deterministic, like in the average of two interval endpoints separated by just one year missing. If several missing values are present in a row a linear change of an indicator is assumed over time between the two observed time points flanking a chunk of missing values.

intervalTime <-  c(1999,2000,2001,2002,2003) 
intervalMeasure <- c( 66.5, NA,NA,NA,87.2) 
currentData <- tibble(time= intervalTime, veval= intervalMeasure) 
currentData
resImputed <- impute_dataset(currentData,
                           countries = "veval",
                           timeName = "time",
                           tailMiss = c("cut", "constant")[2],
                           headMiss = c("cut", "constant")[2]) 
resImputed$res  

In the figure below, grey points are imputed using observed values represented by solid blue points:

tmp <-  as.data.frame(currentData[ c(1,5),] )
tmp2 <- as.data.frame(resImputed$res[2:4,] )

myg <- ggplot(as.data.frame(resImputed$res),  mapping=aes(x=time,y=veval)) + 
  geom_point() + 
  geom_line(data=resImputed$res,col="red") + 
  geom_point(data=tmp,mapping=aes(x=time,y=veval), 
              size=4, 
              colour="blue")  + 
  geom_point(data= tmp2, 
             aes(x=time,y=veval),size=4,alpha=1/3,col="black") + 
  xlab("Time") + ylab("Measure / Index") +  
  ggtitle( "Blue points are observed values (grey ones are missing) \n") 

myg 

It must be emphasized that typical EU indicators run along few years thus $10\%$ of missing values within a country may already represent a context that requires substantive reasoning before interpreting results after imputation. The user may or may not go ahead with the analysis depending on the considered context.

Weighted average smoothing of a complete dataset

It may be of interest to assume that part of the variability observed in a country on a given index is not structural, i.e. not due to causal determinants by to transient fluctuations. Furthermore, the interest here is not directed towards prediction but on smoothing values observed in the whole considered time interval.

In such a case a smoothing procedure remove sudden large changes showing a less variable time serie than the original.

Given that here short time series (panel data) are considered, a three points weighted average is proposed. The smoother substitutes an original raw value $y_{m,i,t}$ of country $m$ indicator $i$ at time $t$ with the weighted average $$\check{y}{m,i,t} = y{m,i,t-1} ~ (1-w)/2 +w ~y_{m,i,t} +y_{m,i,t+1} ~(1-w)/2$$ where $0< w \leq 1$. The special case $w=1$ corresponds to no smoothing. In case of missing values an NA is returned. If the weight is outside the intervel $(0,1]$ then a NA is returned. The first and last values are smoothed using weights $w$ and $1-w$.

After loading data, imputation takes place and finally smoothing is performed. Now, countries IT and DE are considered to illustrate the procedure. First check if missing values are present:

workTB <- dplyr::select(emp_20_64_MS, time, IT,DE)
check_data(workTB)

thus checking is passed, so we go with the smoothing step after deleting the time variable:

resSM <- smoo_dataset(select(workTB,-time), leadW = 0.149, timeTB= select(workTB,time))
resSM

and for a comparison:

compaTB <- select(bind_cols(workTB, select(resSM,-time)), time,IT,IT1,DE,DE1)
compaTB

A graphical output shows changes for "IT", with original index in blue and smoothed index in red:

qplot(time,IT,data=compaTB) + 
  geom_line(colour="navyblue") +
  geom_line(aes(x=time,y=IT1),colour="red") +
  geom_point(aes(x=time,y=IT1),colour="red",shape=8)

similarly for DE

qplot(time,DE,data=compaTB) + 
  geom_line(colour="navyblue") +
  geom_line(aes(x=time,y=DE1),colour="red") +
  geom_point(aes(x=time,y=DE1),colour="red",shape=8)

A weight equal to 1 leaves data unchanged:

resSM <- smoo_dataset(select(workTB,-time), leadW = 1, timeTB= select(workTB,time))
compaTB <- select(bind_cols(workTB, select(resSM,-time)), time,IT,IT1,DE,DE1)
qplot(time,IT,data=compaTB) + 
  geom_line(colour="navyblue") +
  geom_line(aes(x=time,y=IT1),colour="red") +
  geom_point(aes(x=time,y=IT1),colour="red",shape=8)

A time window larger than $3$ could be considered, but deep thoughts are recommended about economic and social changes that may happen in EU during $5$ consecutive years.

Moving Average smoother

Several alternative smoothing algorithm are available in R. Classical ma smoothers are described in the caTools package.

The emp_20_64_MS dataset is now chosen with Italy the country selected to illustrate operations.

data(emp_20_64_MS)
cuTB <- select(emp_20_64_MS,time)
cuTB <- mutate(cuTB,ITori =emp_20_64_MS$IT)

At the begining and end of this series values are averages calculated on a smaller and smaller number of observations (tails):

cuTB <- mutate(cuTB, IT_k_3= runmean(emp_20_64_MS$IT, k=3, 
        alg=c("C", "R", "fast", "exact")[4],
        endrule=c("mean", "NA", "trim", "keep", "constant", "func")[4],
        align = c("center", "left", "right")[1]))

cuTB <-  mutate(cuTB, IT_k_5= runmean(emp_20_64_MS$IT, k=5, 
        alg=c("C", "R", "fast", "exact")[4],
        endrule=c("mean", "NA", "trim", "keep", "constant", "func")[4],
        align = c("center", "left", "right")[1]))

cuTB <-  mutate(cuTB, IT_k_7= runmean(emp_20_64_MS$IT, k=7, 
        alg=c("C", "R", "fast", "exact")[4],
        endrule=c("mean", "NA", "trim", "keep", "constant", "func")[4],
        align = c("center", "left", "right")[1]))

where options alg,endrule,align in the runmean function are discussed in the caTools package.

The Figure below shows results for different degrees of smoothing: original (black), k=3 (red), k=5 (blue), k=7 (orange).

myG <- ggplot(cuTB,aes(x=time,y=ITori))+geom_line()+geom_point()+
       geom_line(aes(x=time,y=IT_k_3),colour="red")+
       geom_point(aes(x=time,y=IT_k_3),colour="red")+
       #
       geom_line(aes(x=time,y=IT_k_5),colour="blue")+
       geom_point(aes(x=time,y=IT_k_5),colour="blue")+
       #
       geom_line(aes(x=time,y=IT_k_7),colour="orange")+
       geom_point(aes(x=time,y=IT_k_7),colour="orange")

myG

It is typically the case that the time serie is so short that at $k=7$ a lot of observations are smoothed with different number of observations (shorter at start and end).

The above calulcations are performed by a function in the convergEU package:

cuTB <-  emp_20_64_MS[,c("time","IT","DE")]
ma_dataset(cuTB, kappa=3, timeName= "time")

that is a bit less flexible but it leads to a standard tidy dataset.


Absolute change

Absolute change is defined as: $$ \Delta y_{m,i,t} = y_{m,i,t} - y_{m,i,t-1} $$ for country $m$, indicator $i$ at time $t$.

The R function abso_change in this package calculates the above quantity, for example in the emp_20_64_MS dataset, which is tidy and without missing values:

data(emp_20_64_MS)
mySTB <- abso_change(emp_20_64_MS, 
                        time_0 = 2005, 
                        time_t = 2010,
                        all_within=TRUE,
                        timeName = "time")
names(mySTB$res)

thus the above equation results in:

mySTB$res$abso_change

If desired, less digits may be displayed, for example after rounding:

round(dplyr::select(mySTB$res$abso_change,AT:UK), 5)

The sum of absolute values $$ \sum_{t=t_0+1}^{} | \Delta y_{m,i,t}|
$$ is:

round(mySTB$res$sum_abs_change,4)

and such sum can be divided by the number of pair of years so that the result is an average per pair of years:

round(mySTB$res$average_abs_change,4)


Summaries and clusters of countries

An important summary is obtained
as unweighted average of country values. The cluster of considered countries may be specified and is also stored within the function generating global static objects and tables, called convergEU_glb(). The illustration of this function exploits the emp_20_64_MS dataset from the convergEU package.

First note that the EU area is made by the following MS:

convergEU_glb()$Eurozone
convergEU_glb()$EU19

while labels representing the 28 MS are:

convergEU_glb()$EU28

The list of known MS labels is:

names(convergEU_glb())[3:8]

For example, the unweighted average in the emp_20_64_MS dataset is:

average_clust(emp_20_64_MS, 
              timeName = "time",
              cluster = "EU28")$res[,c(1,30)]

while for EU12 is:

average_clust(emp_20_64_MS,timeName = "time",cluster = "EU12")$res[,c(1,30)]

An unknown label, like "EUspirit", causes computation error:

average_clust(emp_20_64_MS,timeName = "time",cluster = "EUspirit")

and similarly for a wrong time name:

average_clust(emp_20_64_MS,timeName = "TTime",cluster = "EU19")

Time series can be also plotted:

wwTB <- average_clust(emp_20_64_MS,timeName = "time",cluster = "EU28")$res[,c(1,30)]
mini_EU <- min(wwTB$EU28)
maxi_EU <- max(wwTB$EU28)

qplot(time, EU28, data=wwTB,
      ylim=c(mini_EU,maxi_EU))+geom_line(colour="navy blue")+
      ylab("emp_20_64")


The analysis of convergence

Several measures of convergence have been recently proposed by Eurofound (Eurofound, 2018). In this section, each each measure is introduced and illustrated by a numerical example.

Beta-convergence

Let's assume that we have a tidy dataset (tibble) years by countries. The calculations are performed according the following linear model: $$ \tau^{-1}(ln(y_{m,i,t+\tau})-ln(y_{m,i,t})) = \beta_0 + \beta_1 ln(y_{m,i,t}) +\epsilon_{m,i,t} $$ where $m$ represents the member state of EU (country), $i$ refers to an indicator of interest, $t$ is the reference time and $\tau \in {1,2,\ldots}$ the lenght of the time window (typically $1$ or more years).

In the implementation of function beta_conv() the same reference time is maintained across different years and the division on the left hand side by the amount of time elasped may be skipped argument useTau = FALSE is specified.

The output of beta_conv() is a list in which transformed data, the point estimate of $\beta_1$ and a standard two tails test is reported (p-value and adjusted R squared). One tail test $H_0: \beta_1 \geq 0$ against $H_1: \beta1< 0$ might be of some interest, but it is not implemented.

Below, an example on how to invoke the function:

require(ggplot2)
require(dplyr)
require(tibble)

empBC <- beta_conv(emp_20_64_MS, 
                 time_0 = 2002, 
                 time_t = 2006, 
                 all_within = FALSE, 
                 timeName = "time")
empBC

Note that all_within = FALSE is the default.

A plot of transformed data and the straight line may be useful:

qplot(empBC$res$workTB$indic,
      empBC$res$workTB$deltaIndic,
      xlab="log-Indicator",
      ylab="Delta-log-indicator") +
  geom_abline(intercept = as.numeric(empBC$res$summary[1,2]),
              slope = as.numeric(empBC$res$summary[2,2]),
              colour = "red") +
  geom_text(aes(label=empBC$res$workTB$countries),
            hjust=0, vjust=0,colour="blue")

Note that label are replicated as many times as the number of included subsequent years if all_within=TRUE was specified while invoking beta_conv().


Sigma-convergence

The key concempt in sigma-convergence is variability with respect to the mean. Let $Y_{m,i,t}$ be the value of indicator $i$ for member state $m$ at time $t$, and $\overline{Y}_{A,i,t}$ the average over aggregation $A$, for example $A = EU28$, than:

For each year, the above summaries are calculated to quantify if a reduction in heterogeneity took place.

We assume that all member states in the cluster contributing to the unweighted mean are contained into the dataset, for example:

mySTB <- sigma_conv(emp_20_64_MS,timeName="time")
mySTB

It is possible to select a time window and let the name of time variable to the default for this dataset:

sigma_conv(emp_20_64_MS,time_0 = 2002,time_t = 2004)

As a first step, the departure from the mean can be characterized:

res <- departure_mean(oriTB = emp_20_64_MS, sigmaTB = mySTB$res)
names(res$res)
res$res$departures

where $-1,0,1$ indicates values respectively below $-1$, within the interval $(-1,1)$ and above $+1$. Details on the contribution of each MS to the variance at a given time $t$ is evaluate by the square of the difference $(Y_{m,i,t} - \overline{Y}_{EU28,i,t})^2$ bewteen the indicator $i$ of country $m$ at time $t$ and the unweighted average over member states, say EU28:

res$res$squaredContrib

It is also possible to decompose the numerator of the variance, called deviance, at each time in order to appreciate the percentage of contribution provided by each member state to the total deviance, $$100 \cdot \frac{(Y_{m,i,t} - \overline{Y}{EU28,i,t})^2}{ \sum{m} (Y_{m,i,t} - \overline{Y}_{EU28,i,t})^2 }$$ for the indicator $i$ of country $m$ at time $t$.

res$res$devianceContrib

thus each row adds to $100$.

It is possible to produce a graphical output about the main features of country time series, as shown below:

myGG <- graph_departure(res$res$departures,
                timeName = "time",
                displace = 0.25,
                displaceh = 0.45,
                dimeFontNum = 4,
                myfont_scale = 1.35,
                x_angle = 45,
                color_rect = c("-1"='red1', "0"='gray80',"1"='lightskyblue1'),
                axis_name_y = "Countries",
                axis_name_x = "Time",
                alpha_color = 0.9
                )
myGG

Any selection of countries is feasible:

#myWW1<- warnings()
myGG <- graph_departure(res$res$departures[,1:10],
                timeName = "time",
                displace = 0.25,
                displaceh = 0.45,
                dimeFontNum = 4,
                myfont_scale = 1.35,
                x_angle = 45,
                color_rect = c("-1"='red1', "0"='gray80',"1"='lightskyblue1'),
                axis_name_y = "Countries",
                axis_name_x = "Time",
                alpha_color = 0.29
                )

myGG

Gamma-convergence

We now introduce gamma convergence by an index based on ranks.

Let $y_{m,i,t}$ be the value of indicator $i$ for member state $m$ at time $t=0,1,\ldots, T$, and ${ \tilde{y}{m,i,t}: m \in A )$ the ranks for indicator $i$ over member states in the reference set $A$, for example $A = EU28$, at a given time $t$. The sum of ranks within member state $m$ is: $$ \tilde{y}^{(s)}{m,i} = \sum_{t=0}^T \tilde{y}{m,i,t} $$ thus the variance of the sum of ranks over the given interval $$ Var\left[ {\tilde{y}^{(s)}{m,i}: m \in A } \right] $$ may be compared to the variance of ranks in the reference time $t=0$: $$ Var\left[ {\tilde{y}_{m,i,0}: m \in A } \right] $$

The Kendall index KI, with respect to cluster $A$ of member states for the indicator $i$ over a given time interval is: $$ KI(A,i,T) = \frac{Var\left[ {\tilde{y}^{(s)}{m,i}: m \in A } \right] }{ (T+1)^2 ~~Var\left[{\tilde{y}{m,i,0}: m \in A }\right] } $$

The measure of gamma-convergence is obtained with ethe following function:

gamma_conv(emp_20_64_MS,last=2016,ref=2002,timeName="time")

Delta-convergence

Let $y_{m,i,t}$ be the value of indicator $i$ for member state $m$ at time $t$, and $y^{(M)}{i,t}$ the maximum value over member states in the reference set $A$, for example $A = EU28$: $$ y^{(M)}{i,t} = max({ y_{m,i,t}: m \in A}) $$

The distance of a member state $m$ from the top performer at time $i$ is: $$ y^{(M)}{i,t} - y{m,i,t} $$ thus the overall distance at time $t$, called delta, is the sum of distances over the reference set $A$ of MS: $$ \delta_{i,t} = \sum_{m \in A} (y^{(M)}{i,t} - y{m,i,t}) $$ for the considered indicator $i$.

Delta-convergence can be calculated as follows:

delta_conv(emp_20_64_MS,"time")


Automated production of scoreboard and fiches

The convergEU package offer the possibility of producing scoreboards and fiches in an automated way.


Scoreboards

The basis of scoreboard are raw values of an indicator (level, $y_{m,i,t}$) for MS $m$ at time $t$ for indicator $i$. Differences among subsequent years (change) are as well important, namely $$ y_{m,i,t} - y_{m,i,t-1} $$ thus a function to calculate these values may be exploited.

Let's consider the dataset emp_20_64_MS, to calculate such quantities we do the following:

data(emp_20_64_MS)
resTB <- scoreb_yrs(emp_20_64_MS,timeName = "time")
resTB

where the result is a list of three components: the summary statistics, the numerical labels to indicate the interval of the partition a level belongs to, the interval of the partition a change belongs to.

Numerical labels are assigned as follows (see DRAFT JOINT EMPLOYMENT REPORT FROM THE COMMISSION AND THE COUNCIL):
value $1$ if a the original level or change is $y \leq m -1 \cdot s$;
value $2$ if a the original level or change is $m -1\cdot s < y \leq m - 0.5\cdot s$;
value $3$ if a the original level or change is $m - 0.5\cdot s< y \leq m +0.5\cdot s$;
value $4$ if a the original level or change is $m +0.5\cdot s< y \leq m + 1\cdot s$;
* value $5$ if a the original level or change is $y > m +1\cdot s$.

We note that there is the possibility of representing the above summaries as coloured plots (TO DO) into scoreboards.

For the comparison of a country with the EU average, the following steps are recommended, from raw data:

# require(ggplot2)
# data(emp_20_64_MS)
selectedCountry <- "IT"
timeName <-  "time"
myx_angle <-  45

outSig <- sigma_conv(emp_20_64_MS, timeName = timeName,
           time_0=2002,time_t=2016)
miniY <- min(emp_20_64_MS[,- which(names(emp_20_64_MS) == timeName )])
maxiY <-  max(emp_20_64_MS[,- which(names(emp_20_64_MS) == timeName )])
estrattore<-  emp_20_64_MS[,timeName] >= 2002  &  emp_20_64_MS[,timeName] <= 2016
ttmp <- cbind(outSig$res, dplyr::select(emp_20_64_MS[estrattore,], -contains(timeName)))

myG2 <- 
  ggplot(ttmp) + ggtitle(
  paste("EU average (black, solid) and country",selectedCountry ," (red, dotted)") )+
  geom_line(aes(x=ttmp[,timeName], y =ttmp[,"mean"]),colour="black") +
  geom_point(aes(x=ttmp[,timeName],y =ttmp[,"mean"]),colour="black") +
#        geom_line()+geom_point()+
    ylim(c(miniY,maxiY)) + xlab("Year") +ylab("Indicator") +
  theme(legend.position = "none")+
  # add countries
  geom_line( aes(x=ttmp[,timeName], y = ttmp[,"IT"],colour="red"),linetype="dotted") + 
  geom_point( aes(x=ttmp[,timeName], y = ttmp[,"IT"],colour="red")) +
  ggplot2::scale_x_continuous(breaks = ttmp[,timeName],
                     labels = ttmp[,timeName]) +
   ggplot2::theme(
         axis.text.x=ggplot2::element_text(
         #size = ggplot2::rel(myfont_scale ),
         angle = myx_angle 
         #vjust = 1,
         #hjust=1
         ))

myG2

It is also possible to graphically show departures in terms of the above defined partition:

obe_lvl <- scoreb_yrs(emp_20_64_MS,timeName = timeName)$res$sco_level_num
# select subset of time
estrattore <- obe_lvl[,timeName] >= 2009 & obe_lvl[,timeName] <= 2016  
scobelvl <- obe_lvl[estrattore,]

my_MSstd <- ms_dynam( scobelvl,
                timeName = "time",
                displace = 0.25,
                displaceh = 0.45,
                dimeFontNum = 3,
                myfont_scale = 1.35,
                x_angle = 45,
                axis_name_y = "Countries",
                axis_name_x = "Time",
                alpha_color = 0.9
                )   

my_MSstd


Country fiches

The counvergEU package provides a function that automatically prepares one or more country fiches. This function is able to create a directory along an existing path and to copy the rmarkdown file representing the template within it. The rmarkdown file is parameterized so that passing different parameters the compilation takes place with different data, say different indicators and countries.

It is very important to prepare complete data in a tibble (dataset) made by a time variable and as many other variables as countries that enter into the calculation of the time average. Failing to satisfy this requisite causes the use of a wrong mean value at each year. Nevertheless one key country is specified and some other countries of interest may be listed to better decorate graphs and compare performances.

Below, a call to the function go_ms_fi() illustrates the syntax:

go_ms_fi(
    workDF ='myTB',
    countryRef ='DE',
    otherCountries = "c('IT','UK','FR')",
    time_0 = 2002,
    time_t = 2016,
    tName = 'time',
    indiType = "highBest",
    aggregation= 'EU28',
    x_angle=  45,
    dataNow=  Sys.time(),
    author = 'A.Student',
    outFile = 'Germany-up2-2016', 
    outDir = "/media/fred/STORE/PRJ/2018-TENDER-EU/STEP-1/bitbucketed/tt-fish",
    indiName= 'emp_20_64_MS'
)

but it is very important to emphasize some constraints and unusual ways to pass parameters to such a function. In fact, note that the first argument is the working dataset which is passed not as an R object but as a string, the name of the dataset that must be available in the R workspace before invoking go_ms_fi.
The second argument countryRef is a string with the short name of a member country that will be shown in one-country plots. Less obvious, argument indiType = "lowBest" specifies if the considered indicator is built so that a low value is good for a country or if a high value is good (indiType = "highBest").

Of particular importance the argument outFile that can be a string indicating the name of the output file. Similarly outDir is the path (unit and folders) in which the final compiled html will be stored. The sintax of the path depend on the operating system; for example outDir='F:/analysis/IT2018' indicates that in the usb disk called 'F', within the folder 'analysis' is located folder 'IT2018' where R will write the country fiche. Note that a disk called 'F' must exist and also folder 'analysis' must exist in such unit, while on the contrary folder 'IT2018' is created by the function if it does not already exist.

Within the above mentioned output directory, besides the compiled html, it is also stored a file called like specified by outFile but with added the string '-workspace.RData' that contains data and plots produced during the compilation of the country fiche for further subsequent use in other technical reports.


Indicator fiches

An auxiliary function go_indica_fi() is provided in the R package convergEU to produce an indicator fiches, where the output is an html file. At this purpose, an output directory must be also specifed. Note that some arguments are passed as strings instead of objects, as described in the last section above.

An example of syntax to invoke the procedure is:

go_indica_fi(
    time_0 = 2005,
    time_t = 2010,
    timeName = 'time',
    workingDF = 'emp_20_64_MS' ,
    indicaT = 'emp_20_64',
    indiType = c('highBest','lowBest')[1],
    seleMeasure = 'all',
    seleAggre = 'EU28',
    x_angle =  45,
    data_res_download =  FALSE,
    auth = 'A.Student',
    dataNow =  '2019/05/16',
    outFile = "test_IT-emp_20_64_MS",
    outDir = "/media/fred/STORE/PRJ/2018-TENDER-EU/STEP-1/bitbucketed/tt-fish"
  )



References

Below the main references are listed:









federico-m-stefanini/convergEU documentation built on July 30, 2023, 3:22 a.m.