R/fitgroup_da.R

Defines functions fitgroup.da

Documented in fitgroup.da

#' Estimation of the Dagum distribution from group data
#'
#' The function \code{fitgroup.da} implements the estimation of the Dagum distribution from group
#' data in form of income shares using the equally weighted minimum distance (EWMD) and the 
#' optimally weighted minimum distance (OMD) estimators.
#'
#' @inheritParams fitgroup.gb2
#' @return the function \code{fitgroup.da} returns the following objects:
#'   \itemize{
#'     \item \code{ewmd.estimation} Matrix containing the parameters of the Dagum distribution estimated
#'        by EWMD and, if \code{se.ewmd = TRUE}, their standard errors.
#'     \item \code{ewmd.rss} Residual sum of squares of the EWMD estimation.
#'     \item \code{omd.estimation} Matrix containing the parameters of the Dagum distribution estimated
#'        by OMD and, if \code{se.omd = TRUE}, their standard errors.
#'     \item \code{omd.rss} Weighted residual sum of squares of the OMD estimation.
#'     \item \code{gini.estimation} Vector with the survey Gini index and the estimated Gini
#'      indices using EWMD and OMD estimates whenever possible.
#'   }
#' @details The Generalised Beta of the Second Kind (GB2) is a general class of distributions that is
#' acknowledged to provide an accurate fit to income data (McDonald 1984; McDonald and Mantrala,1995).
#' The Dagum distribution is a particular case of this model with \eqn{q = 1}, defined in terms of
#' the cumulative distribution function as follows:
#' \deqn{F(x; a, b, p) = \bigg(1+\bigg(\frac{x}{b}\bigg)^{-a}\bigg)^{-p} }
#'
#' where \eqn{b} is the scale parameter and \eqn{a, p} are the shape parameters that define the
#' heaviness of the tail and the skewness of the distribution.
#'
#' The function \code{fitgroup.da} estimates the parameters of the Dagum distribution using grouped data in form of
#' income shares. These data must have been generated by setting the proportion of observations in each
#' group before sampling, so that the population proportions are fixed, whereas income shares are random
#' variables. Examples of this type of data can be found in the largest datasets of grouped data,
#' includingThe World Income Inequality Database (UNU-WIDER, 2017), PovcalNet (World Bank, 2018) or the World Wealth
#' and Income Database (Alvaredo et al., 2018).
#'
#' For EWMD, numerical optimisation is achieved using the Levenberg-Marquardt Algorithm via \code{\link[minpack.lm]{nlsLM}}
#' Conventionally, moment estimates of a restricted model are taken as initial values. A potential
#' limitation of this method is that, as the dimensionality of the parameter
#' space increases, it is more difficult to achieve global convergence. Although it seems
#' quite intuitive that the moment estimates of the restricted model might be a good starting
#' point, the optimization could converge to
#' a local minimum, which might lead to inaccurate estimates of the parameters.
#'
#' To provide different non-arbitrary combinations of starting values, we propose to
#' define a sequence of numbers (provided by \code{grid}). For each value in this
#' sequence, the moment estimate of one of the parameters is obtained using the survey Gini index,
#' assuming that the other one is equal to the grid value. Using this procedure,
#' we end up with as many combinations of initial values as values in the grid,
#' which are used to obtain different sets of estimates, keeping the one with the smallest
#' residual sum of squares. Although we cannot ensure that our estimates belong to
#' the global minimum, this procedure covers a larger proportion of the parameter
#' space than just using the moment estimates of a
#' particular sub-model. See Jorda et al. (2018) for details.
#'
#' This method, however, does not provide
#' an estimate for the scale parameter because the Lorenz curve is independent to scale. The scale
#' parameter is estimated by equating the sample mean, specified by \code{pc.inc}, to the population
#' mean of the Dagum distribution. Because EWMD does not use the optimal
#' covariance matrix of the moment conditions, the standard errors of the parameters
#' are obtained by Monte Carlo simulation. Please be aware that the estimation of the standard errors
#' might take a long time, especially if the sample size is large.
#'
#' \code{fitgroup.da} also implements a two-stage OMD estimator. In the first stage, EWMD estimates
#' are obtained as described above, which are used to compute a first stage estimator
#' of the weighting matrix. The weighting matrix is used in the second stage to obtain optimally
#' weighted estimates of the parameters. The numerical optimisation is performed using
#' \code{\link{optim}} with the BFGS method. If \code{optim} reports an error, the L-BFGS method
#' is used. EWMD estimates are used as initial values for the optimisation algorithm. The OMD estimation
#'  incorporates the optimal weight matrix, thus making possible to derive the asymptotic standard
#'  errors of the parameters using results from Beach and Davison(1983) and Hajargasht and
#'  Griffiths (2016). As in the EWMD estimation, the scale parameter is obtained by matching the
#'  population mean of the Dagum distribution to the sample mean. Hence, the standard error of the scale
#'  parameter is estimated by Monte Carlo simulation.
#'
#' The Gini index of the Dagum distribution is computed using the function
#' \code{simgini.da} which makes use of \code{gini.d}.
#' If this function reports NaN, the Gini index is estimated by Monte
#' Carlo simulation of 10^6 samples of size N = 10^6.
#'
#'
#' @references
#'  Alvaredo, F., A. Atkinson, T. Piketty, E. Saez, and G. Zucman. The World Wealth and Income Database.
#'
#' Beach, C.M. and R. Davidson (1983): Distribution-free statistical inference with
#' Lorenz curves and income shares, \emph{The Review of Economic Studies}, 50, 723 - 735.
#'
#'  Hajargasht, G. and W.E. Griffiths (2016): Inference for Lorenz Curves, Tech. Rep.,
#'  The University of Melbourne.
#'
#'  Jorda, V., Sarabia, J.M., & Jäntti, M. (2018). Estimation of income inequality from grouped data.
#'  arXiv preprint arXiv:1808.09831.
#'
#'  McDonald, J.B. (1984): Some Generalized Functions for the Size Distribution of Income,
#'  \emph{Econometrica}, 52, 647 - 665.
#'
#'  McDonald, J.B. and A. Mantrala (1995): The distribution of personal income: revisited,
#'  \emph{Journal of Applied Econometrics}, 10, 201 - 204.
#'
#'  UNU-WIDER (2018). World Income Inequality Database (WIID3.4).
#'
#'  World Bank (2018). PovcalNet Data Base. Washington, DC: World Bank. \url{http://iresearch.worldbank.org/PovcalNet/home.aspx}.
#'
#' @export
#' @examples
#' fitgroup.da(y = c(9, 13, 17, 22, 39), gini.e = 0.29)
#'
#' @importFrom minpack.lm nlsLM nls.lm.control
#' @importFrom stats resid coef uniroot optim
#' @export
#'
fitgroup.da <- function(y, x = rep(1 / length(y), length(y)), gini.e, pc.inc = NULL, se.omd = FALSE, se.ewmd = FALSE, se.scale = FALSE, N = NULL, nrep = 10^3, grid = 1:20, rescale = 1000, gini = FALSE) {
  if(length(y) != length(x)) {
    stop("x and y must be of the same length")
  }
  if(length(y) < 4) {
    stop("At least four points of the Lorenz curve are required to perform the estimation")
  }
  if(!is.numeric(gini.e)) {
    stop("Gini index is not numeric")
  }
  if(gini.e < 0 | gini.e > 1) {
    stop("Gini index must be between 0 and 1")
  }
  if(!is.numeric(pc.inc) & !is.null(pc.inc)) {
    stop("Per capita income is not numeric")
  }
  if(is.numeric(pc.inc)) {
    if(pc.inc <= 0){
      stop("Per capita GDP should be positive")
    }
  }
  if(!is.numeric(rescale)) {
    stop("Rescale is not numeric")
  }
  if(rescale <= 0) {
    stop("Rescale must be positive")
  }
  if(sum(grid <= 0) != 0) {
    stop("Grid must be a secuence of positive numbers")
  }
  share <- as.vector(y[!is.na(y)])
  share <- as.numeric(share)/sum(share)
  share <- cumsum(share)[-length(share)]
  cprob <- as.vector(x[!is.na(x)])
  cprob <- as.numeric(cprob)/sum(cprob)
  cprob <- cumsum(cprob)[-length(cprob)]

  par.a  <- rep(0, length(grid))
  par.p  <- rep(0, length(grid))
  rss <- rep(1000, length(grid))

  for (j in (1:length(grid))) {
    ginit <- function(a) {
      gini.d(c(a, grid[j])) - gini.e
    }
    free.p <- try(uniroot(ginit, c(0.5, 1000)), silent = TRUE)
    if('try-error'%in%class(free.p)) {
      ginit <- function(a) {
        gini.d(c(grid[j], a)) - gini.e
      }
      free.p <- try(uniroot(ginit, c(0.5, 1000)), silent = TRUE)
      if(!'try-error'%in%class(free.p)) {
        regress <- try(suppressWarnings(nlsLM(share ~ (lc.gb2(c(A, 1, P, 1), cprob)), algorithm ="port",
          start = list(A = grid[j], P = free.p$root),lower=c(0, 0), control = nls.lm.control(maxiter = 1000))),
          silent = TRUE)
        if(!'try-error'%in%class(regress)) {
          par.a[j] <- coef(regress)[1]
          par.p[j] <- coef(regress)[2]
          rss[j] <- sum(resid(regress)^2)
        }
      }
    }
    else{
      regress <- try(suppressWarnings(nlsLM(share ~ (lc.gb2(c(A, 1, P, 1), cprob)), algorithm ="port",
        start = list(A = free.p$root, P = grid[j]), lower = c(0, 0), control = nls.lm.control(maxiter = 1000))),
        silent = TRUE)
      if(!'try-error'%in%class(regress)){
        par.a[j] <- coef(regress)[1]
        par.p[j] <- coef(regress)[2]
        rss[j] <- sum(resid(regress)^2)
      }
    }
  }
  rg <- which.min(rss)
  nls.rss <- min(rss)
  nls.a <- par.a[rg]
  nls.p <- par.p[rg]
  if(1 <= 1 / nls.a) {
    temp.b <- NA
    print("Unable to compute the scale parameter and the OMD estimation. Condition for the existence of the first moment violated: 1 <= 1 / a")
  }
  if(is.null(pc.inc)) {
    temp.b <- NA
    print("Unable to compute the scale parameter and the OMD estimation. Per capita GDP not provided")
  }
  if(!is.null(pc.inc) & (1 > 1 / nls.a)) {
    incpc <- pc.inc / rescale
    temp.b <- scale.gb2(c(nls.a, nls.p, 1), incpc)
  }
  if(1 <= 2 / nls.a){
    print("Unable to compute OMD estimates of the parameters. Condition for the existence of the second moment violated: 1 <= 2 / a")
  }
  if(1 <= 2 / nls.a | is.na(temp.b)) {
    gmm.coef <- matrix(NA, 1, 3)
    gmm.se <- matrix(NA, 1, 3)
    gmm.rss <- NA
  }
  else{
    regress <- try(opt.gmm.da(cprob, share, init.est = c(nls.a, temp.b, nls.p), cons.est = c(nls.a, temp.b, nls.p)))
    if('try-error'%in%class(regress$opt1)) {
      print("Unable to compute OMD estimates of the parameters. The weight martrix cannot be inverted. Try changing the value of rescale")
      gmm.coef <- matrix(NA, 1, 3)
      gmm.se <- matrix(NA, 1, 3)
      gmm.rss <- NA
    }
    else {
      gmm.rss <- regress$opt1$value
      gmm.coef <- regress$opt1$par
      gmm.coef[2] <- scale.gb2(c(gmm.coef[1], gmm.coef[3], 1), pc.inc)
      gmm.se <- matrix(NA, 1, 3)
    }
  }
  nls.estimation <- matrix(NA, 2, 3)
  nls.coef <- matrix(c(nls.a, temp.b * rescale, nls.p), 1, 3)
  nls.se <- matrix(NA, 1, 3)

  if (se.ewmd == TRUE) {
    if (is.null(N)) {
      print("Unable to compute the standard errors. Please provide the sample size")
    }
    else {
      se.calc <- simsd.da(x = x, theta = nls.coef, N = N, nrep = nrep, se.scale = se.scale, factor.r = rescale)
      nls.se <- se.calc$nls.se
      if(!is.na(temp.b)) {
        if(se.omd == TRUE){
          gmm.se <- gmmse.da(gmm.coef, cprob, N)
        }
        if(se.scale == TRUE){
          gmm.se[2] <- se.calc$sd.scale
        }
      }
    }
  }
  if (se.ewmd == FALSE & se.omd == TRUE) {
    if (is.null(N)) {
      print("Unable to compute the standard errors. Please provide the sample size")
    }
    else {
      if(!is.na(temp.b)) {
        gmm.se <- gmmse.da(gmm.coef, cprob, N)
        if(se.scale == TRUE){
          gmm.se[2] <- simsd.da(x = x, theta = nls.coef, N = N, nrep = nrep, se.scale = se.scale, factor.r = rescale)$sd.scale
        }
      }
    }
  }
  nls.estimation[1, ] <- nls.coef
  nls.estimation[2, ] <- nls.se
  colnames(nls.estimation) <- c("a", "b", "p")
  row.names(nls.estimation) <- c("Coef.", "se")

  gmm.estimation <- matrix(NA, 2, 3)
  gmm.estimation[1, ] <- gmm.coef
  gmm.estimation[2, ] <- gmm.se
  colnames(gmm.estimation) <- c("a", "b", "p")
  row.names(gmm.estimation) <- c("Coef.", "se")

  grouped.data <- rbind(share, cprob)
  row.names(grouped.data) <- c("Income", "Population")

  if (gini == TRUE) {
    if (!is.na(gmm.rss)) {
      gmm.gini <- simgini.da(gmm.coef)
    }
    else {gmm.gini <- NA}
    nls.gini <- simgini.da(nls.coef)
    gini.estimation <- matrix(NA, 1, 3)
    colnames(gini.estimation) <- c("Survey", "EWMD estimate", "OMD estimate")
    gini.estimation[1] <- gini.e
    gini.estimation[2] <- nls.gini
    gini.estimation[3] <- gmm.gini
    out2 <- list(grouped.data = grouped.data, distribution = "Dagum", ewmd.estimation = nls.estimation, ewmd.rss = nls.rss, omd.estimation = gmm.estimation, omd.rss = gmm.rss,
      gini.estimation = gini.estimation)
  }
  else {
    out2 <- list(grouped.data = grouped.data, distribution = "Dagum", ewmd.estimation = nls.estimation, ewmd.rss = nls.rss, omd.estimation = gmm.estimation, omd.rss = gmm.rss)
  }
  return(out2)
}

Try the GB2group package in your browser

Any scripts or data that you put into this service are public.

GB2group documentation built on Jan. 26, 2021, 5:06 p.m.