fgini: Gini index, variances and confidence intervals in finite...

View source: R/fgini.R

fginiR Documentation

Gini index, variances and confidence intervals in finite populations

Description

Estimates the Gini index and computes variances and confidence intervals in finite populations.

Usage

fgini(
  y,
  w,
  method = 2L,
  interval = NULL,
  Pi = NULL,
  Pij = NULL,
  PiU,
  alpha = 0.05,
  B = 1000L,
  na.rm = TRUE,
  varformula = "SYG",
  large.sample = FALSE
)

Arguments

y

A vector with the non-negative real numbers to be used for estimating the Gini index.

w

A numeric vector with the survey weights to be used for estimating the Gini index, the variance and the confidence interval. This argument can be missing if argument Pi is provided.

method

An integer between 1 and 5 selecting one of the 5 methods detailed below for estimating the Gini index in finite populations. The default method is method = 2L.

interval

A character string specifying the type of variance estimation and confidence interval to be used. Possible values are "zjackknife", "zalinearization", "zblinearization" and "pbootstrap". interval = NULL omits the computation of both variance and confidence interval. The default value is interval = NULL.

Pi

A numeric vector with the (sample) first inclusion probabilites to be used for estimating the Gini index, the variance and the confidence interval. This argument can be NULL if argument w is provided. The default value is Pi = NULL.

Pij

A numeric square matrix with the (sample) second (joint) inclusion probabilites to be used for the variance estimation and the confidence interval. The Hajek approximation is used when Pij = NULL. This argument is used when interval={"zjackknife", "zalinearization", "zblinearization"}. The default value is Pij = NULL.

PiU

A numeric vector with the (population) first inclusion probabilites. This argument is only required when the Hartley-Rao expression for the variance estimation is selected (varformula = "HR").

alpha

A single numeric value between 0 and 1. If interval is not NULL, the confidence level to be used for computing the confidence interval for the Gini is 1-alpha. Some authors call alpha the significance level. The default value is alpha = 0.05.

B

A single integer specifying the number of bootstrap replicates. This argument is required when interval = "pbootsptrap". The default value is B = 1000L.

na.rm

A 'TRUE/FALSE' logical value indicating whether NA's should be removed before the computation proceeds. The default value is na.rm = TRUE.

varformula

A character string specifying the type of formula to be used for the variance estimator when interval = {"zjackknife", "zalinearization", "zblinearization"}. Possible values are "HT" (Hortvitz-Thompson), "SYG" (Sen-Yates-Grundy) and "HR" (Hartley-Rao). The default value is varformula = "SYG".

large.sample

A 'TRUE/FALSE' logical value indicating indicating whether the sample is large to apply a faster algorithm to sort the sample values in the computation of the Gini index. The default value is large.sample = FALSE.

Details

For a sample S, with size n and inclusion probabilities \pi_i=P(i\in S) (argument Pi), derived from a finite population U, with size N, different formulations of the Gini index have been proposed in the literature. his function estimates the Gini index, variances and confidence intervals using various formulations. The different methods for estimating the Gini index are (see also Muñoz et al., 2023):

\ Gini Index formulae.

method = 1 (Langel and Tillé, 2013)

\widehat{G}_{w1}= \displaystyle \frac{1}{2\widehat{N}^{2}\overline{y}_{w}}\sum_{i \in S}\sum_{j \in S}w_{i}w_{j}|y_{i}-y_{j}|,

where \widehat{N}=\sum_{i \in S}w_i, \overline{y}_{w}=\widehat{N}^{-1}\sum_{i \in S}w_{i}y_{i}, and w_i are the survey weights. For example, the survey weights can be w_i=\pi_{i}^{-1}. w or Pi must be provided, but not both. It is required that w_i = \pi_i^{-1}, for i \in S, when both w and Pi are provided.

method = 2 (Alfons and Templ, 2012; Langel and Tillé, 2013)

\widehat{G}_{w2} =\displaystyle \frac{2\sum_{i \in S}w_{(i)}^{+}\widehat{N}_{(i)}y_{(i)} - \sum_{i \in S}w_{i}^{2}y_{i} }{\widehat{N}^{2}\overline{y}_{w}}-1,

where y_{(i)} are the values y_i sorted in increasing order, w_{(i)}^{+} are the values w_i sorted according to the increasing order of the values y_i, and \widehat{N}_{(i)}=\sum_{j=1}^{i}w_{(j)}^{+}. Langel and Tillé (2013) show that \widehat{G}_{w1} = \widehat{G}_{w2}.

method = 3 (Berger, 2008)

\widehat{G}_{w3} = \displaystyle \frac{2}{\widehat{N}\overline{y}_{w}}\sum_{i \in S}w_{i}y_{i}\widehat{F}_{w}^{\ast}(y_{i})-1,

where

\widehat{F}_{w}^{\ast}(t) = \displaystyle \frac{1}{\widehat{N}}\sum_{i \in S}w_{i}[\delta(y_i < t) + 0.5\delta(y_i = t)]

is the smooth (mid-point) distribution function, and \delta(\cdot) is the indicator variable that takes the value 1 when its argument is true, and the value 0 otherwise. It can be seen that \widehat{G}_{w2} = \widehat{G}_{w3}.

method = 4 (Berger and Gedik-Balay, 2020)

\widehat{G}_{w4} = 1 - \displaystyle \frac{\overline{v}_{w}}{\overline{y}_{w}},

where \overline{v}_{w}=\widehat{N}^{-1}\sum_{i \in S}w_{i}v_{i} and

v_{i} = \displaystyle \frac{1}{\widehat{N} - w_{i}}\sum_{ \substack{j \in S\\ j\neq i}}\min(y_{i},y_{j}).

method = 5 (Lerman and Yitzhaki, 1989)

\widehat{G}_{w5} = \displaystyle \frac{2}{\widehat{N}\overline{y}_{w}} \sum_{i \in S} w_{(i)}^{+}[y_{(i)} - \overline{y}_{w}]\left[ \widehat{F}_{w}^{LY}(y_{(i)}) - \overline{F}_{w}^{LY} \right],

where

\widehat{F}_{w}^{LY}(y_{(i)}) = \displaystyle \frac{1}{\widehat{N}}\left(\widehat{N}_{(i-1)} + \frac{w_{(i)}^{+}}{2} \right)

and \overline{F}_{w}^{LY}=\widehat{N}^{-1}\sum_{i \in S}w_{(i)}^{+}\widehat{F}_{w}^{LY}(y_{(i)}).

\ Variances and confidence intervals.

For a given estimator \widehat{G}_{w} and variable z, the Horvitz-Thompson type variance estimator (Hortvitz and Thompson, 1952)

\widehat{V}_{HT}(\widehat{G}_{w}) = \displaystyle \sum_{i\in S}\sum_{j\in S}\breve{\Delta}_{ij}w_{i}w_{j}z_{i}z_{j}

is computed when varformula = "HT", where

\breve{\Delta}_{ij}=\displaystyle \frac{\pi_{ij}-\pi_{i}\pi_{j}}{\pi_{ij}}

and \pi_{ij} is the second (joint) inclusion probability of the individuals i and j, i.e., \pi_{ij}=P\{(i,j)\in S)\} (argument Pij).

The Sen-Yates-Grundy type variance estimator (Sen, 1953; Yates and Grundy, 1953)

\widehat{V}_{SYG}(\widehat{G}_{w}) = - \displaystyle \frac{1}{2}\sum_{i\in S}\sum_{j\in S}\breve{\Delta}_{ij}(w_{i}z_i-w_{j}z_{j})^{2}

is computed when varformula = "SYG", and the Hartley-Rao type variance estimator (Hartley and Rao, 1962)

\widehat{V}_{HR}(\widehat{G}_{w}) = \displaystyle \frac{1}{n-1}\sum_{i\in S}\sum_{\substack{j \in S\\ j < i}}\left(1-\pi_i-\pi_j + \frac{1}{n}\sum_{k\in U}\pi_{k}^{2} \right)(w_{i}z_i-w_{j}z_{j})^{2}

is computed when varformula = "HR". Note that the The Horvitz-Thompson variance estimator can give negative values. We observe that both Horvitz-Thompson and Sen-Yates-Grundy variance estimators depend on second (joint) inclusion probabilities (argument Pij). The Hajek (1964) approximation

\pi_{ij}\cong \pi_{i}\pi_{j}\left[1- \displaystyle \frac{(1-\pi_{i})(1-\pi_{j})}{\sum_{i \in S}(1-\pi_{i})} \right]

is used when the second (joint) inclusion probabilities are not available (Pij = NULL). Note that the Hajek approximation is suggested for large-entropy sampling designs, large samples, and large populations (see Tille 2006; Berger and Tille, 2009; Haziza et al., 2008; Berger, 2011). For instance, this approximation is not recomended for highly-stratified samples (Berger, 2005). The Hartley-Rao variance estimator requires the first inclusion probabilities at the population level (argument PiU). zjakknife computes the confidence interval based on the jackknife technique with critical values based on the Normal approximation. zalinearization and zblinearization compute the confidence intervals based on the linearization technique applied to the estimators

\widehat{G}_{w}^{a} = \widehat{G}_{w1}

and

\widehat{G}_{w}^{b} = \displaystyle \frac{2}{\widehat{N}\overline{y}_{w}}\sum_{i \in S}w_{i}y_{i}\widehat{F}_{w}(y_{i})-1,

respectively, where

\widehat{F}_{w}(t)=\frac{1}{\widehat{N}}\sum_{i \in S}w_i\delta(y_i \leq t).

Critical values are also based on the Normal approximation. pbootstrap computes the variance using the rescaled bootstrap, and the confidence interval is constructed using the percentile method. The vignette vignette("GiniVarInterval") contains a detailed description of the various methods for variance estimation and confidence intervals for the Gini index.

The following table summarises the various types of variances and confidence intervals that the function fgini computes. The argument varformula only applies for the jackknife and linearization techniques (see Berger, 2008; Langel and Tillé, 2013).

Interval Variance Critical values References
_______________ ______________ _________________ _________________________
zjackknife Jackknife Normal Berger (2008)
zalinearization Linearization Normal Langel and Tille (2013)
zblinearization Linearization Normal Berger (2008)
pBootstrap Rescaled bootstrap Percentile bootstrap Berger and Gedik-Balay (2020)

Value

When interval = NULL, the function returns a single numeric value between 0 and 1 informing about the estimation of the Gini index. When interval is not NULL, the function returns a list with 3 components: a single numeric value with the estimation of the Gini index; a single numeric value with the variance estimation of the Gini index; and a vector of length two containing the lower and upper limits of the confidence interval for the Gini index.

Author(s)

Juan F Munoz jfmunoz@ugr.es

Jose M Pavia pavia@uv.es

Encarnacion Alvarez encarniav@ugr.es

References

Alfons, A., and Templ, M. (2012). Estimation of social exclusion indicators from complex surveys: The R package laeken. KU Leuven, Faculty of Business and Economics Working Paper.

Berger, Y. G. (2005). Variance estimation with highly stratified sampling designs with unequal probabilities. Australian & New Zealand Journal of Statistics, 47, 365–373.

Berger, Y. G. (2008). A note on the asymptotic equivalence of jackknife and linearization variance estimation for the Gini Coefficient. Journal of Official Statistics, 24(4), 541-555.

Berger, Y. G. (2011). Asymptotic consistency under large entropy sampling designs with unequal probabilities. Pakistan Journal of Statistics, 27, 407–426.

Berger, Y. G. and Tillé, Y. (2009). Sampling with unequal probabilities. In Sample Surveys: Design, Methods and Applications (eds. D. Pfeffermann and C. R. Rao), 39–54. Elsevier, Amsterdam

Berger, Y., and Gedik-Balay, I. (2020). Confidence intervals of Gini coefficient under unequal probability sampling. Journal of Official Statistics, 36(2), 237-249.

Hajek, J. (1964). Asymptotic theory of rejective sampling with varying probabilities from a finite population. The Annals of Mathematical Statistics, 35, 4, 1491–1523.

Hartley, H. O., and Rao, J. N. K. (1962). Sampling with unequal probabilities and without replacement. The Annals of Mathematical Statistics, 350-374.

Haziza, D., Mecatti, F. and Rao, J. N. K. (2008). Evaluation of some approximate variance estimators under the Rao-Sampford unequal probability sampling design. Metron, LXVI, 91–108.

Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47, 663–685.

Langel, M., and Tille, Y. (2013). Variance estimation of the Gini index: revisiting a result several times published. Journal of the Royal Statistical Society: Series A (Statistics in Society), 176(2), 521-540.

Lerman, R. I., and Yitzhaki, S. (1989). Improving the accuracy of estimates of Gini coefficients. Journal of econometrics, 42(1), 43-47.

Muñoz, J. F., Moya-Fernández, P. J., and Álvarez-Verdejo, E. (2023). Exploring and Correcting the Bias in the Estimation of the Gini Measure of Inequality. Sociological Methods & Research. https://doi.org/10.1177/00491241231176847

Sen, A. R. (1953). On the estimate of the variance in sampling with varying probabilities. Journal of the Indian Society of Agricultural Statistics, 5, 119–127.

Tillé, Y. (2006). Sampling Algorithms. Springer, New York.

Yates, F., and Grundy, P. M. (1953). Selection without replacement from within strata with probability proportional to size. Journal of the Royal Statistical Society B, 15, 253–261.

See Also

fginindex, fcompareCI

Examples

# Income and weights (region 'Burgenland') from the 2006 Austrian EU-SILC (Package 'laeken').
data(eusilc, package="laeken")
y <- eusilc$eqIncome[eusilc$db040 == "Burgenland"]
w <- eusilc$rb050[eusilc$db040 == "Burgenland"]

# Estimation of the Gini index using 'method = 2' .
fgini(y, w)


y <- c(30428.83, 14976.54, 18094.09, 29476.79, 20381.93, 6876.17,
       10360.96, 8239.82, 29476.79, 32230.71)
w <- c(357.86, 480.99, 480.99, 476.01, 498.58, 498.58, 476, 498.58, 476.01, 476.01)

# Gini index estimation and confidence interval using:
 ## a: The method 2 for point estimation.
 ## b: The method 'zjackknife' for variance estimation.
 ## c: The Sen-Yates-Grundy type variance estimator.
 ## d: The Hajek approximation for the joint inclusion probabilities.
fgini(y, w, interval = "zjackknife")

# Gini index estimation and confidence interval using:
 ## a: The method 2 for point estimation.
 ## b: The method 'zalinearization' for variance estimation.
 ## c: The Sen-Yates-Grundy type variance estimator.
 ## d: The Hajek approximation for the joint inclusion probabilities.
fgini(y, w, interval = "zalinearization")

# Gini index estimation and confidence interval using:
 ## a: The method 3 for point estimation.
 ## b: The method 'zblinearization' for variance estimation.
 ## c: The Sen-Yates-Grundy type variance estimator.
 ## d: The Hajek approximation for the joint inclusion probabilities.
fgini(y, w, method = 3L, interval = "zblinearization")

# Gini index estimation and confidence interval using:
 ## a: The method 2 for point estimation.
 ## b: The method 'pbootstrap' for variance estimation.
 ## c: The percentile bootstrap method for the confidence interval.
fgini(y, w, interval = "pbootstrap")

giniVarCI documentation built on May 29, 2024, 3:36 a.m.