R/findClubs.R
In ConvergenceClubs: Finding Convergence Clubs

Documented in findClubs

#' Finds convergence clubs
#'
#' Find convergence clubs by means of Phillips and Sul clustering procedure.
#'
#' @param X dataframe containing data (preferably filtered data in order to remove business cycles).
#'  Data must not contain any NA or NaN values, otherwise the clustering procedure will be stopped with an error.
#'  In order for the clustering procedure to work, data must be arranged such that
#'  each row represents a unit and each column represents a time period (year, month, ...)
#' @param dataCols integer vector with the column indices of the data
#' @param unit_names integer scalar indicating, if present, the index of a column
#' with codes of the units
#' @param refCol integer scalar indicating the index of the column to use for ordering
#' data
#' @param time_trim a numeric value between 0 and 1, representing the portion of
#' time periods to trim when running \emph{log-t}  regression model.
#' Phillips and Sul (2007, 2009) suggest to discard the first third of the period.
#' @param HACmethod string indicating whether a Fixed Quadratic Spectral Bandwidth (\code{HACmethod="FQSB"}) or
#' an Adaptive Quadratic Spectral Bandwidth (\code{HACmethod="AQSB"}) should be used for the truncation
#' of the Quadratic Spectral kernel in estimating the \emph{log-t} regression model
#' with heteroskedasticity and autocorrelation consistent standard errors.
#' The default method is "FQSB".
#' @param cstar numeric scalar, indicating the threshold value of the sieve criterion (\eqn{c^*}{c*})
#' to include units in the detected core (primary) group (step 3 of Phillips and Sul (2007, 2009) clustering algorithm).
#' The default value is 0.
#'
#'
#' @param cstar_method a string specifying whether cstar should be mantained fixed
#' (\code{cstar_method="fixed"}) or increased iteratively until the whole club satisfies
#' the condition \eqn{tvalue>-1.65} (\code{cstar_method="incremental"}).
#' Default is \code{cstar_method="fixed"} (see Details).
#' @param cstar_increment a positive value specifying the increment to cstar,
#' only valid if \code{cstar_method="incremental"} (see Details); the default value is 0.1.
#' @param cstar_cap scalar indicating the maximum value up to which \code{cstar} can
#' be increased; the default value is 3.
#'
#'
#' @return Ad object of class \code{convergence.clubs}, containing a list of
#' Convergence Clubs, for each club a list is return with the
#' following objects: \code{id}, a vector containing the row indices
#' of the units in the club; \code{model}, a list containing information
#' about the model used to run the t-test on the units in the club;
#' \code{unit_names}, a vector containing the names of the units of the club (optional,
#' only included if parameter \code{unit_names} is given)
#'
#'
#' @details In order to investigate the presence of convergence clubs according
#' to the Phillips and Sul clustering procedure, the following steps are implemented:
#'    \enumerate{
#'        \item (Cross section last observation ordering):
#'              Sort units in descending order according to the last panel observation of the period;
#'        \item (Core group formation):
#'        Run the \emph{log-t}  regression for the first k units \eqn{(2 < k < N)} maximizing k
#'        under the condition that t-value is \eqn{> -1.65}. In other words, chose the core group size k* as follows:
#'
#'        \deqn{k^*= argmax_k \{t_k\} }{k* = argmax \, t(k)  }  subject to \deqn{ min\{t_k \} > -1.65}{min t(k) > 1.65}
#'
#'        If the condition \eqn{t_k >-1.65}{t(k) > -1.65} does not hold for \eqn{k = 2} (the first two units),
#'        drop the first unit and repeat the same procedure. If \eqn{t_k >-1.65}{t(k) > -1.65} does not hold
#'        for any units chosen, the whole panel diverges;
#'        \item (Sieve the data for club membership): After the core group  is detected,
#'        run the \emph{log-t} regression for the core group adding (one by one)
#'        each unit that does not belong to the latter. If \eqn{t_k }{t(k)} is greater than a critical value \eqn{c^*}{c*}
#'        add the new unit in the convergence club.
#'        All these units (those included in the core group \eqn{k^*}{k*} plus those added) form the first convergence club.
#'        Note that Phillips and Sul (2007) suggest to make sure \eqn{t_k>-1.65}{t(k) > -1.65}
#'        for the subconvergence group obtained. Otherwise, repeat the procedure by
#'        increasing the value of the \eqn{c^*}{c*} parameter until the condition
#'        \eqn{t_k>-1.65}{t(k) > -1.65} is satisfied for the subconvergence group;
#'
#'        \item (Recursion and stopping rule): If there are units for which the previous
#'        condition fails (\eqn{t_k <c^*}{t(k) < c*}), gather all these units in one
#'        group and run the \emph{log-t}  test to see if the condition \eqn{t_k >-1.65}{t(k) > -1.65}
#'        holds. If the condition is satisfied, conclude that there are two convergence clubs.
#'        Otherwise, step 1 to 3 should be repeated on the same group to determine
#'        whether there are other subgroups that constitute convergence clubs.
#'        If no k in step 2 satisfies the condition \eqn{t_k >-1.65}{t(k) > -1.65},
#'        the remaining units diverge.
#'    }
#'
#'
#'
#' Note that the clustering procedure may return groups with \eqn{t_k <-1.65}{t(k) < -1.65},
#' which are not really convergence clubs. In this case, following step 3 of
#' the clustering procedure there are two options:
#' (i) allow an iterative increase
#' of the cstar parameter until the subconvergence club satisfies the condition
#' \eqn{t_k >-1.65}{t(k) > -1.65}. In this case it should the argument \code{cstar_method}
#' should be set to \code{"incremental"} and a positive argument for the \code{cstar_increment}
#' argument should be chosen;
#' (ii) increase the value of the \code{cstar} in order to increase the discriminatory
#' power of the \emph{log-t} test in the formation of each club.
#'
#' Information about clubs, divergent units and the \eqn{c^*}{c*} used for each club can be
#' easily displayed by means of the \code{summary()} function, for which the package provides
#' a specific method for the \code{convergence.clubs} class.
#'
#'
#'
#'
#'
#' @references
#'
#' Andrews, D. W., 1991. Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica: Journal of the Econometric Society, 817-858.
#'
#' Phillips, P. C.; Sul, D., 2007. Transition modeling and econometric convergence tests. Econometrica 75 (6), 1771-1855.
#'
#' Phillips, P. C.; Sul, D., 2009. Economic transition and growth. Journal of Applied Econometrics 24 (7), 1153-1185.
#'
#'
#'
#'
#' @seealso
#' \code{\link{mergeClubs}}, Merges a list of clubs created by \code{findClubs};
#'
#' \code{\link{mergeDivergent}}, Merges divergent units according to the algorithm proposed by von Lyncker and Thoennessen (2017)
#'
#'
#'
#' @examples
#' data("filteredGDP")
#'
#'
#' # Cluster Countries using GDP from year 1970 to year 2003
#' clubs <- findClubs(filteredGDP,  dataCols=2:35, unit_names = 1, refCol=35,
#'                    time_trim = 1/3, HACmethod = "FQSB",
#'                    cstar = 0,
#'                    cstar_method = 'incremental',
#'                    cstar_increment = 0.1)
#'
#'
#'
#' \dontrun{
#' # Cluster Countries using GDP from year 1970 to year 2003
#' clubs <- findClubs(filteredGDP,  dataCols=2:35, unit_names = 1, refCol=35,
#'                    time_trim = 1/3, HACmethod = "AQSB", cstar = 0)
#'
#'}
#'
#' @export




findClubs<- function(X, #data matrix or data.frame
                     dataCols, #vector with column indices of data,
                     unit_names = NULL, #column index of units, if present
                     refCol, #column index of year to be used as reference (lastT)
                     time_trim = 1/3, #portion of years to remove from computations (a value between >0 and <1)
                     HACmethod = c('FQSB','AQSB'),
                     cstar = 0, #c* value for the second step,
                     cstar_method = c('fixed', 'incremental'),
                     cstar_increment = 0.1,
                     cstar_cap = 3){


    ### Initialise variables ---------------------------------------------------
    HACmethod    <- match.arg(HACmethod)
    cstar_method <- match.arg(cstar_method)
    returnNames  <- switch(class(unit_names),
                          NULL = FALSE,
                          numeric = TRUE,
                          integer = TRUE,
                          stop("Not a valid value for argument 'unit_names'; it should be an integer"))

    N <- nrow(X)
    t <- length(dataCols)

    threshold <- -1.65

    #output
    clubs <- structure(list(),
                       class    = c("convergence.clubs", "list"),
                       data     = X,
                       dataCols = dataCols,
                       unit_names = unit_names,
                       refCol     = refCol,
                       time_trim  = time_trim,
                       HACmethod  = HACmethod,
                       cstar           = cstar,
                       cstar_method    = cstar_method,
                       cstar_increment = cstar_increment
    )
    ### Check inputs -----------------------------------------------------------

    #unit_names
    if(length(unit_names) > 1) stop("Argument 'unit_names' must be an integer-valued scalar")
    if( returnNames){ if(unit_names %% 1 != 0)  stop(" Argument 'unit_names' must be an integer-valued scalar")}

    #X
    if(!is.data.frame(X)) stop('X must be an object of class data.frame')
    X[,unit_names] <- as.character(X[,unit_names])

    #dataCols
    if(!all(apply(X[,dataCols],2,is.numeric)) ) stop('Some of the data columns are non-numeric')

    #length of time series
    if(t < 2) stop('At least two time periods are needed to run this procedure')

    #time_trim
    if( length(time_trim) > 1 | !is.numeric(time_trim) ) stop('time_trim must be a numeric scalar')
    if( time_trim > 1 | time_trim <= 0 ) stop('invalid value for time_trim; should be a value between 0 and 1')
    if( (t - round(t*time_trim)) < 2) stop('either the number of time periods is too small or the value of time_trim is too high')

    #refCol
    if( length(refCol) > 1 ) stop('refCol must be an integer-valued scalar')
    if( !is.numeric(refCol) ) stop('refCol must be an integer value indicating the column number of reference year')
    if( refCol %% 1 != 0 ) stop('refCol must be an integer-valued scalar')
    if( refCol > ncol(X) ) stop('Wrong refCol value; there is no such column')

    #cstar
    if(!is.numeric(cstar) | length(cstar) > 1) stop('cstar must be a numeric scalar!')
    if(cstar_method=='incremental' &
       (!is.numeric(cstar_increment) | length(cstar_increment) > 1 | cstar_increment <0) ){
        stop('cstar_increment must be a positive numeric scalar!')
    }
    if(cstar_method=='incremental' &
       (!is.numeric(cstar_cap) | length(cstar_cap) > 1 | cstar_cap < 0) ){
        stop('cstar_cap must be a positive numeric scalar!')
    }



    ### Other preliminary operations -------------------------------------------

    #add id column to dataset
    X$id <- 1:N
    #Sort data by clustering variable (decreasing)
    dati <- X[order(X[,refCol],decreasing = TRUE),]

    ### Find clubs -------------------------------------------------------------
    #Cluster procedure
    l <- 1
    while(TRUE){
        if (nrow(dati) == 1){
            clubs$divergent$id <- dati$id
            break #break while loop if out of units
        }else if(nrow(dati) == 0){
            break
        }

        #Test all units
        H_all   <- computeH(dati[,dataCols])
        mod_all <- estimateMod(H_all, time_trim, HACmethod=HACmethod)
        t_all   <- mod_all['tvalue']
        # if tvalue > -1.65, they all form one club,
        #otherwise go one with clustering
        if (t_all > threshold) {
            clubs[[paste('club',l,sep = '')]] <- list(
                # units = as.character(dati[,IDvar]),
                id    =  dati$id,
                model = mod_all,
                cstar = cstar)
            break
        }

        #find core group (returns the row indices of units in core Group)
        coreGroup <- coreG(X=dati, dataCols, time_trim, threshold,
                           HACmethod = HACmethod, type="max")
        #if no more core groups are found, add divergent to output and return
        if (identical(coreGroup, FALSE) ){
            # nl <- length(clubs)
            clubs$divergent$id <- dati$id
            break
        }

        #add units to core group
        convClub <- club(X = dati, dataCols, core = coreGroup, time_trim,
                         HACmethod = HACmethod,
                         cstar = cstar,
                         cstar_method = cstar_method,
                         cstar_increment = cstar_increment,
                         cstar_cap = cstar_cap)
        # newcstar <- clubConv$model$cstar
        # xidclub <- which(X[,IDvar] %in% as.character(clubConv$units))
        clubs[[paste('club',l,sep = '')]] <- list( id = convClub$id,
                                                   model = convClub$model,
                                                   cstar = convClub$cstar
        )
        dati <- dati[-convClub$rows,]#remove the club found from the dataset
        l <- l + 1
    }#end of while, end of clustering

    ### Return -----------------------------------------------------------------
    #if returnNames, then add region codes to output
    if(returnNames){
        for(club in names(clubs) ){
            clubs[[club]]$unit_names <- X[clubs[[club]]$id, unit_names]
        }
        return(clubs)
    }else return(clubs)
}