R/fitgroup_b2.R

Defines functions fitgroup.b2

Documented in fitgroup.b2

#' Estimation of the Beta 2 distribution from group data
#'
#' The function \code{fitgroup.b2} implements the estimation of the Beta 2 distribution from group
#' data in form of income shares using the equally weighted minimum distance (EWMD) and the 
#' optimally weighted minimum distance (OMD) estimators.
#'
#' @inheritParams fitgroup.gb2
#' @return the function \code{fitgroup.b2} returns the following objects:
#'   \itemize{
#'     \item \code{ewmd.estimation} Matrix containing the parameters of the Beta 2 distribution estimated
#'        by EWMD and, if \code{se.ewmd = TRUE}, their standard errors.
#'     \item \code{ewmd.rss} Residual sum of squares of the EWMD estimation.
#'     \item \code{omd.estimation} Matrix containing the parameters of the Beta 2 distribution estimated
#'        by OMD and, if \code{se.omd = TRUE}, their standard errors.
#'     \item \code{omd.rss} Weighted residual sum of squares of the OMD estimation.
#'     \item \code{gini.estimation} Vector with the survey Gini index and the estimated Gini
#'      indices using EWMD and OMD estimates whenever possible.
#'   }
#' @details The Generalised Beta of the Second Kind (GB2) is a general class of distributions that is
#' acknowledged to provide an accurate fit to income data (McDonald 1984; McDonald and Mantrala,1995).
#' The Beta 2 distribution is a particular case of this model with \eqn{a = 1}, defined in terms of
#' the cumulative distribution function as follows:
#' \deqn{F(x; b, p, q) = B\bigg(\frac{x/b}{1+x/b}; p; q\bigg); x > 0,}
#'
#' where \eqn{b} is the scale parameter and \eqn{p, q} are the shape parameters that define the
#' heaviness of the tail and the skewness of the distribution.
#'
#' The function \code{fitgroup.da} estimates the parameters of the Beta 2 distribution using grouped data in form of
#' income shares. These data must have been generated by setting the proportion of observations in each
#' group before sampling, so that the population proportions are fixed, whereas income shares are random
#' variables. Examples of this type of data can be found in the largest datasets of grouped data,
#' including The World Income Inequality Database (UNU-WIDER, 2017), PovcalNet (World Bank, 2018) or the World Wealth
#' and Income Database (Alvaredo et al., 2018).
#'
#' For EWMD, numerical optimisation is achieved using the Levenberg-Marquardt Algorithm via \code{\link[minpack.lm]{nlsLM}}
#' Conventionally, moment estimates of a restricted model are taken as initial values. A potential
#' limitation of this method is that, as the dimensionality of the parameter
#' space increases, it is more difficult to achieve global convergence. Although it seems
#' quite intuitive that the moment estimates of the restricted model might be a good starting
#' point, the optimization could converge to
#' a local minimum, which might lead to inaccurate estimates of the parameters.
#'
#' To provide different non-arbitrary combinations of starting values, we propose to
#' define a sequence of numbers (provided by \code{grid}). For each value in this
#' sequence, the moment estimate of one of the parameters is obtained using the survey Gini index,
#' assuming that the other one is equal to the grid value. Using this procedure,
#' we end up with as many combinations of initial values as values in the grid,
#' which are used to obtain different sets of estimates, keeping the one with the smallest
#' residual sum of squares. Although we cannot ensure that our estimates belong to
#' the global minimum, this procedure covers a larger proportion of the parameter
#' space than just using the moment estimates of a
#' particular sub-model. See Jorda et al. (2018) for details.
#'
#' This method, however, does not provide
#' an estimate for the scale parameter because the Lorenz curve is independent to scale. The scale
#' parameter is estimated by equating the sample mean, specified by \code{pc.inc}, to the population
#' mean of the Beta 2 distribution. Because EWMD does not use the optimal
#' covariance matrix of the moment conditions, the standard errors of the parameters
#' are obtained by Monte Carlo simulation. Please be aware that the estimation of the standard errors
#' might take a long time, especially if the sample size is large.
#'
#' \code{fitgroup.b2} also implements a two-stage OMD estimator. In the first stage, EWMD estimates
#' are obtained as described above, which are used to compute a first stage estimator
#' of the weighting matrix. The weighting matrix is used in the second stage to obtain optimally
#' weighted estimates of the parameters. The numerical optimisation is performed using
#' \code{\link{optim}} with the BFGS method. If \code{optim} reports an error, the L-BFGS method
#' is used. EWMD estimates are used as initial values for the optimisation algorithm. The OMD estimation
#'  incorporates the optimal weight matrix, thus making possible to derive the asymptotic standard
#'  errors of the parameters using results from Beach and Davison(1983) and Hajargasht and
#'  Griffiths (2016). As in the EWMD estimation, the scale parameter is obtained by matching the
#'  population mean of the Beta 2 distribution to the sample mean. Hence, the standard error of the scale
#'  parameter is estimated by Monte Carlo simulation.
#'
#' The Gini index of the Beta 2 distribution is computed using the function
#' \code{simgini.b2} which makes use of \code{gini.b2}.
#' If this function reports NaN, the Gini index is estimated by Monte
#' Carlo simulation of 10^6 samples of size N = 10^6.
#'
#'
#' @references
#'
#'  Alvaredo, F., A. Atkinson, T. Piketty, E. Saez, and G. Zucman. The World Wealth and Income Database.
#'
#' Beach, C.M. and R. Davidson (1983): Distribution-free statistical inference with
#' Lorenz curves and income shares, \emph{The Review of Economic Studies}, 50, 723 - 735.
#'
#'  Hajargasht, G. and W.E. Griffiths (2016): Inference for Lorenz Curves, Tech. Rep.,
#'  The University of Melbourne.
#'
#'  Jorda, V., Sarabia, J.M., & Jäntti, M. (2018). Estimation of income inequality from grouped data.
#'  arXiv preprint arXiv:1808.09831.
#'
#'  McDonald, J.B. (1984): Some Generalized Functions for the Size Distribution of Income,
#'  \emph{Econometrica}, 52, 647 - 665.
#'
#'  McDonald, J.B. and A. Mantrala (1995): The distribution of personal income: revisited,
#'  \emph{Journal of Applied Econometrics}, 10, 201 - 204.
#'
#'  UNU-WIDER (2018). World Income Inequality Database (WIID3.4).
#'
#'  World Bank (2018). PovcalNet Data Base. Washington, DC: World Bank. \url{http://iresearch.worldbank.org/PovcalNet/home.aspx}.
#'
#' @export
#' @examples
#' fitgroup.b2(y = c(9, 13, 17, 22, 39), gini.e = 0.29)
#'
#' @importFrom minpack.lm nlsLM nls.lm.control
#' @importFrom stats resid coef uniroot optim
#' @export
#'
fitgroup.b2 <- function(y, x = rep(1 / length(y), length(y)), gini.e, pc.inc = NULL, se.omd = FALSE, se.ewmd = FALSE, se.scale = FALSE, N = NULL, nrep = 10^3, grid = 1:20, rescale = 1000, gini = FALSE) {
  if(length(y) != length(x)) {
    stop("x and y must be of the same length")
  }
  if(length(y) < 4) {
    stop("At least four points of the Lorenz curve are required to perform the estimation")
  }
  if(!is.numeric(gini.e)) {
    stop("Gini index is not numeric")
  }
  if(gini.e < 0 | gini.e > 1) {
    stop("Gini index must be between 0 and 1")
  }
  if(!is.numeric(pc.inc) & !is.null(pc.inc)) {
    stop("Per capita income is not numeric")
  }
  if(is.numeric(pc.inc)) {
    if(pc.inc <= 0){
      stop("Per capita GDP should be positive")
    }
  }
  if(!is.numeric(rescale)) {
    stop("Rescale is not numeric")
  }
  if(rescale <= 0) {
    stop("Rescale must be positive")
  }
  if(sum(grid <= 0) != 0) {
    stop("Grid must be a secuence of positive numbers")
  }
  share <- as.vector(y[!is.na(y)])
  share <- as.numeric(share)/sum(share)
  share <- cumsum(share)[-length(share)]
  cprob <- as.vector(x[!is.na(x)])
  cprob <- as.numeric(cprob)/sum(cprob)
  cprob <- cumsum(cprob)[-length(cprob)]

  par.p  <- rep(0, length(grid))
  par.q  <- rep(0, length(grid))
  rss <- rep(1000, length(grid))

  for (j in (1:length(grid))) {
    ginit <- function(p) {
      gini.b2(c(p, grid[j])) - gini.e
    }
    free.p <- try(suppressWarnings(uniroot(ginit, c(0.5, 1000))), silent = TRUE)
    if('try-error'%in%class(free.p)) {
      ginit <- function(p) {
        gini.b2(c(grid[j], p)) - gini.e
      }
      free.p <- try(suppressWarnings(uniroot(ginit, c(0.5, 1000))), silent = TRUE)
      if(!'try-error'%in%class(free.p)) {
        regress <- try(suppressWarnings(nlsLM(share ~ (lc.gb2(c(1, 1, P, Q), cprob)), algorithm ="port",
          start = list(P=grid[j], Q=free.p$root), lower=c(0, 0), control=nls.lm.control(maxiter=1000))),
          silent=TRUE)
        if(!'try-error'%in%class(regress)) {
          par.p[j] <- coef(regress)[1]
          par.q[j] <- coef(regress)[2]
          rss[j] <- sum(resid(regress)^2)
        }
      }
    }
    else {
      regress <- try(suppressWarnings(nlsLM(share ~ (lc.gb2(c(1, 1, P, Q), cprob)), algorithm ="port",
        start = list(P=free.p$root, Q=grid[j]), lower=c(0, 0), control=nls.lm.control(maxiter=1000))),
        silent = TRUE)
      if(!'try-error'%in%class(regress)) {
        par.p[j] <- coef(regress)[1]
        par.q[j] <- coef(regress)[2]
        rss[j] <- sum(resid(regress)^2)
      }
    }
  }
  rg <- which.min(rss)
  nls.rss <- min(rss)
  nls.p <- par.p[rg]
  nls.q <- par.q[rg]
  if(nls.q <= 1) {
    temp.b <- NA
    print("Unable to compute the scale parameter and the OMD estimation. Condition for the existence of the first moment violated: q <= 1")
  }
  if(is.null(pc.inc)) {
    temp.b <- NA
    print("Unable to compute the scale parameter and the OMD estimation. Per capita GDP not provided")
  }
  if(!is.null(pc.inc) & (nls.q > 1)) {
    incpc <- pc.inc/rescale
    temp.b <- scale.gb2(c(1, nls.p, nls.q), incpc)
  }
  if(nls.q <= 2){
    print("Unable to compute OMD estimates of the parameters. Condition for the existence of the second moment violated: q <= 2")
  }
  if(nls.q <= 2 | is.na(temp.b)) {
    gmm.coef <- matrix(NA, 1, 3)
    gmm.se <- matrix(NA, 1, 3)
    gmm.rss <- NA
  }
  else{
    regress <- try(opt.gmm.b2(cprob, share, init.est = c(temp.b, nls.p, nls.q), cons.est = c(temp.b, nls.p, nls.q)))
    if('try-error'%in%class(regress$opt1)) {
      print("Unable to compute OMD estimates of the parameters. The weight martrix cannot be inverted. Try changing the value of rescale")
      gmm.coef <- matrix(NA, 1, 3)
      gmm.se <- matrix(NA, 1, 3)
      gmm.rss <- NA
    }
    else {
      gmm.rss <- regress$opt1$value
      gmm.coef <- regress$opt1$par
      gmm.coef[1] <- scale.gb2(c(1, gmm.coef[2], gmm.coef[3]), pc.inc)
      gmm.se <- matrix(NA, 1, 3)
    }
  }
  nls.estimation <- matrix(NA, 2, 3)
  nls.coef <- matrix(c(temp.b * rescale, nls.p, nls.q), 1, 3)
  nls.se <- matrix(NA, 1, 3)
  if (se.ewmd == TRUE) {
    if (is.null(N)) {
      print("Unable to compute the standard errors. Please provide the sample size")
    }
    else {
      se.calc <- simsd.b2(x = x, theta = nls.coef, N = N, nrep = nrep, se.scale = se.scale, factor.r = rescale)
      nls.se <- se.calc$nls.se
      if(!is.na(temp.b)) {
        if(se.omd == TRUE){
          gmm.se <- gmmse.b2(gmm.coef, cprob, N)
        }
        if(se.scale == TRUE){
          gmm.se[1] <- se.calc$sd.scale
        }
      }
    }
  }
  if (se.ewmd == FALSE & se.omd == TRUE) {
    if (is.null(N)) {
      print("Unable to compute the standard errors. Please provide the sample size")
    }
    else {
      if(!is.na(temp.b)) {
        gmm.se <- gmmse.b2(gmm.coef, cprob, N)
        if(se.scale == TRUE){
          gmm.se[1] <- simsd.b2(x = x, theta = nls.coef, N = N, nrep = nrep, se.scale = se.scale, factor.r = rescale)$sd.scale
        }
      }
    }
  }
  nls.estimation[1, ] <- nls.coef
  nls.estimation[2, ] <- nls.se
  colnames(nls.estimation) <- c("b", "p", "q")
  row.names(nls.estimation) <- c("Coef.", "se")

  gmm.estimation <- matrix(NA, 2, 3)
  gmm.estimation[1, ] <- gmm.coef
  gmm.estimation[2, ] <- gmm.se
  colnames(gmm.estimation) <- c("b", "p", "q")
  row.names(gmm.estimation) <- c("Coef.", "se")

  grouped.data <- rbind(share, cprob)
  row.names(grouped.data) <- c("Income", "Population")

  if (gini == TRUE) {
    if (!is.na(gmm.rss)) {
      gmm.gini <- simgini.b2(gmm.coef)
    }
    else {gmm.gini <- NA}
    nls.gini <- simgini.b2(nls.coef)
    gini.estimation <- matrix(NA, 1, 3)
    colnames(gini.estimation) <- c("Survey", "EWMD estimate", "OMD estimate")
    gini.estimation[1] <- gini.e
    gini.estimation[2] <- nls.gini
    gini.estimation[3] <- gmm.gini
    out2 <- list(grouped.data = grouped.data, distribution = "Beta 2", ewmd.estimation = nls.estimation, ewmd.rss = nls.rss, omd.estimation = gmm.estimation, omd.rss = gmm.rss,
      gini.estimation = gini.estimation)
  }
  else {
    out2 <- list(grouped.data = grouped.data, distribution = "Beta 2", ewmd.estimation = nls.estimation, ewmd.rss = nls.rss, omd.estimation = gmm.estimation, omd.rss = gmm.rss)
  }
  return(out2)
}

Try the GB2group package in your browser

Any scripts or data that you put into this service are public.

GB2group documentation built on Jan. 26, 2021, 5:06 p.m.