svystatQ: Estimation of Quantiles in Subpopulations
In DiegoZardetto/ReGenesees: R Evolved Generalized Software for Sampling Estimates and Errors in Surveys

svystatQ

R Documentation

Estimation of Quantiles in Subpopulations

Description

Calculates estimates, standard errors and confidence intervals for Quantiles of numeric variables in subpopulations.

Usage

svystatQ(design, y, probs = c(0.25, 0.5, 0.75), by = NULL,
         vartype = c("se", "cv", "cvpct", "var"),
         conf.lev = 0.95, na.rm = FALSE,
         ties=c("discrete", "rounded"))

## S3 method for class 'svystatQ'
coef(object, ...)
## S3 method for class 'svystatQ'
SE(object, ...)
## S3 method for class 'svystatQ'
VAR(object, ...)
## S3 method for class 'svystatQ'
cv(object, ...)
## S3 method for class 'svystatQ'
confint(object, ...)

Arguments

`design`	Object of class `analytic` (or inheriting from it) containing survey data and sampling design metadata.
`y`	Formula defining the interest variable.
`probs`	Vector of probability values to be used to calculate the quantiles estimates. The default value selects estimates of quartiles.
`by`	Formula specifying the variables that define the "estimation domains". If `NULL` (the default option) estimates refer to the whole population.
`vartype`	`character` vector specifying the desired variability estimators. It is possible to choose one or more of: standard error (`'se'`, the default), coefficient of variation (`'cv'`), percent coefficient of variation (`'cvpct'`), or variance (`'var'`).
`conf.lev`	Probability specifying the desired confidence level: the default value is `0.95`.
`na.rm`	Should missing values (if any) be removed from the variable of interest? The default is `FALSE` (see ‘Details’).
`ties`	How should duplicated observed values be treated? Select `'discrete'` for a genuinely discrete interest variable and `'rounded'` for a continuous one.
`object`	An object of class `svystatQ`.
`...`	Additional arguments to `coef`, ..., `confint` methods (if any).

Details

This function calculates weighted estimates for the Quantiles of a quantitative variable using suitable weights depending on the class of design: calibrated weights for class cal.analytic and direct weights otherwise.

Standard errors are calculated using the so-called "Woodruff method" [Woodruff 52][Sarndal, Swensson, Wretman 92]: (i) first a confidence interval (at a given confidence level 1-a) is constructed for the relative frequency of units with values below the estimated quantile, (ii) then the inverse of the estimated cumulative relative frequency distribution (ECDF) is used to map this interval to a confidence interval for the quantile, (iii) lastly the desired standard error is estimated by dividing the length of the obtained confidence interval by the value 2*qnorm(1-a/2). Notice that the procedure above builds, in general, asymmetric confidence intervals around the estimated quantiles.

The mandatory argument y identifies the variable of interest, that is the variable for which estimates of quantiles have to be calculated. The design variable referenced by y must be numeric.

The optional argument probs specifies the probability values (0.001<=probs[i]<=0.999) corresponding to the quantiles one wants to estimate; the default option selects quartiles.

The optional argument by specifies the variables that define the "estimation domains", that is the subpopulations for which the estimates are to be calculated. If by=NULL (the default option), the estimates produced by svystatQ refer to the whole population. Estimation domains must be defined by a formula: for example the statement by=~B1:B2 selects as estimation domains the subpopulations determined by crossing the modalities of variables B1 and B2. Notice that a formula like by=~B1+B2 will be automatically translated into the factor-crossing formula by=~B1:B2: if you need to compute estimates for domains B1 and B2 separately, you have to call svystatQ twice. The design variables referenced by by (if any) should be of type factor, otherwise they will be coerced.

The conf.int argument allows to request the confidence intervals for the estimates. By default conf.int=FALSE, that is the confidence intervals are not provided.

Whenever confidence intervals are requested (i.e. conf.int=TRUE), the desired confidence level can be specified by means of the conf.lev argument. The conf.lev value must represent a probability (0<=conf.lev<=1) and its default is chosen to be 0.95.

Missing values (NA) in interest variables should be avoided. If na.rm=FALSE (the default) they generate NAs in estimates (or even an error, if design is calibrated). If na.rm=TRUE, observations containing NAs are dropped, and estimates get computed on non missing values only. This implicitly assumes that missing values hit interest variables completely at random: should this not be the case, computed estimates would be biased.

Argument ties addresses the problem of how to treat duplicated observed values (if any) when computing the ECDF. Option 'discrete' (the default) is appropriate when the variable of interest is genuinely discrete, while 'rounded' is a better choice for a continuous variable, i.e. when duplicates stem from rounding. In the first case the ECDF will show a vertical step corresponding to a duplicated value, in the second a smoother shape will be achieved by linear interpolation.

Value

An object inheriting from the data.frame class, whose detailed structure depends on input parameters' values.

Author(s)

Diego Zardetto

References

Woodruff, R.S. (1952) “Confidence Intervals for Medians and Other Position Measures”, Journal of the American Statistical Association, Vol. 47, No. 260, pp. 635-646.

Sarndal, C.E., Swensson, B., Wretman, J. (1992) “Model Assisted Survey Sampling”, Springer Verlag.

Examples

# Creation of a design object:
data(data.examples)
des<-e.svydesign(data=example,ids=~towcod+famcod,strata=~SUPERSTRATUM,
     weights=~weight)

# Estimate of the deciles of the income variable for
# the whole population:
svystatQ(des,~income,probs=seq(0.1,0.9,0.1),ties="rounded")


# Another design object:
data(sbs)
des<-e.svydesign(data=sbs,ids=~id,strata=~strata,weights=~weight,
     fpc=~fpc)

# Estimation of the median value added 
# for economic activity macro-sectors:
svystatQ(des,~va.imp2,probs=0.5,by=~nace.macro,
         ties="rounded",vartype="cvpct")

# Estimation of the Interquartile Range (IQR) of the number
# of employees for economic activity macro-sectors:
apply(svystatQ(des,~emp.num,probs=c(0.25,0.75),by=~nace.macro)[,2:3],1,diff)

DiegoZardetto/ReGenesees documentation built on Dec. 16, 2024, 2:03 p.m.