options(rmarkdown.html_vignette.check_title = FALSE) knitr::opts_chunk$set(echo = TRUE)
For the most part, this document will present the functionalities of the function
surveysd::calc.stError()
which generates point estimates and standard errors for
user-supplied estimation functions.
In order to use a dataset with calc.stError()
, several weight columns have to be present. Each
weight column corresponds to a bootstrap sample. In the following examples, we will use the data
from demo.eusilc()
and attach the bootstrap weights using draw.bootstrap()
and recalib()
. Please
refer to the documentation of those functions for more detail.
library(surveysd) set.seed(1234) eusilc <- demo.eusilc(prettyNames = TRUE) dat_boot <- draw.bootstrap(eusilc, REP = 10, hid = "hid", weights = "pWeight", strata = "region", period = "year") dat_boot_calib <- recalib(dat_boot, conP.var = "gender", conH.var = "region", epsP = 1e-2, epsH = 2.5e-2, verbose = FALSE) dat_boot_calib[, onePerson := nrow(.SD) == 1, by = .(year, hid)] ## print part of the dataset dat_boot_calib[1:5, .(year, povertyRisk, eqIncome, onePerson, pWeight, w1, w2, w3, w4, w5)]
The parameters fun
and var
in calc.stError()
define the estimator to be used in the error
analysis. There are two built-in estimator functions weightedSum()
and weightedRatio()
which can
be used as follows.
povertyRate <- calc.stError(dat_boot_calib, var = "povertyRisk", fun = weightedRatio) totalIncome <- calc.stError(dat_boot_calib, var = "eqIncome", fun = weightedSum)
Those functions calculate the ratio of persons at risk of poverty (in percent) and the total income. By default, the results are calculated separately for each reference period.
povertyRate$Estimates totalIncome$Estimates
Columns that use the val_
prefix denote the point estimate belonging to the "main weight" of the
dataset, which is pWeight
in case of the dataset used here.
Columns with the stE_
prefix denote standard errors calculated with bootstrap replicates. The
replicates result in using w1
, w2
, ..., w10
instead of pWeight
when applying the estimator.
n
denotes the number of observations for the year and N
denotes the total weight of those
persons.
In order to define a custom estimator function to be used in fun
, the function needs to have
at least two arguments like the example below.
## define custom estimator myWeightedSum <- function(x, w) { sum(x*w) } ## check if results are equal to the one using `surveysd::weightedSum()` totalIncome2 <- calc.stError(dat_boot_calib, var = "eqIncome", fun = myWeightedSum) all.equal(totalIncome$Estimates, totalIncome2$Estimates)
The parameters x
and w
can be assumed to be vectors with equal length with w
being numeric
weight vector and x
being the column defined in the var
argument. It will be called once
for each period (in this case year
) and for each weight column (in this case pWeight
, w1
, w2
, ..., w10
).
Custom estimators using additional parameters can also be supplied and parameter add.arg
can be used
to set the additional arguments for the custom estimator.
## use add.arg-argument fun <- function(x, w, b) { sum(x*w*b) } add.arg = list(b="onePerson") err.est <- calc.stError(dat_boot_calib, var = "povertyRisk", fun = fun, period.mean = 0, add.arg=add.arg) err.est$Estimates # compare with direct computation compare.value <- dat_boot_calib[,fun(povertyRisk,pWeight,b=onePerson), by=c("year")] all((compare.value$V1-err.est$Estimates$val_povertyRisk)==0)
The above chunk computes the weighted poverty ratio for single person households.
In our example the variable povertyRisk
is a boolean and is TRUE
if the income is less
than 60% of the weighted median income. Thus it directly depends on the original weight vector
pWeight
. To further reduce the estimated error one should calculate for each bootstrap replicate
weight $w$ the weighted median income $medIncome_{w}$ and then define $povertyRisk_w$ as
$$ povertyRisk_w = \cases{1 \quad\text{if Income}<0.6\cdot medIncome_{w}\ 0 \quad\text{else}} $$
The estimator can then be applied to the new variable $povertyRisk_w$. This can be realized using a custom estimator function.
# custom estimator to first derive poverty threshold # and then estimate a weighted ratio povmd <- function(x, w) { md <- laeken::weightedMedian(x, w)*0.6 pmd60 <- x < md # weighted ratio is directly estimated inside the function return(sum(w[pmd60])/sum(w)*100) } err.est <- calc.stError( dat_boot_calib, var = "povertyRisk", fun = weightedRatio, fun.adjust.var = povmd, adjust.var = "eqIncome") err.est$Estimates
The approach shown above is only valid if no grouping variables are supplied (parameter group = NULL
).
If grouping variables are supplied one should use parameters fun.adjust.var
and adjust.var
such that
the $povertyRisk_w$ is first calculated for each period
and then used for each grouping in group
.
# using fun.adjust.var and adjust.var to estimate povmd60 indicator # for each period and bootstrap weight before applying the weightedRatio povmd2 <- function(x, w) { md <- laeken::weightedMedian(x, w)*0.6 pmd60 <- x < md return(as.integer(pmd60)) } # set adjust.var="eqIncome" so the income vector is used to estimate # the povmd60 indicator for each bootstrap weight # and the resulting indicators are passed to function weightedRatio group <- "gender" err.est <- calc.stError( dat_boot_calib, var = "povertyRisk", fun = weightedRatio, group = "gender", fun.adjust.var = povmd2, adjust.var = "eqIncome") err.est$Estimates
In case an estimator should be applied to several columns of the dataset, var
can be set to
a vector containing all necessary columns.
multipleRates <- calc.stError(dat_boot_calib, var = c("povertyRisk", "onePerson"), fun = weightedRatio) multipleRates$Estimates
Here we see the relative number of persons at risk of poverty and the relative number of one-person households.
The groups
argument can be used to calculate estimators for different subsets of the data. This
argument can take the grouping variable as a string that refers to a column name (usually a factor)
in dat
. If set, all estimators are not only split by the reference period but also by the
grouping variable. For simplicity, only one reference period of the above data is used.
dat2 <- subset(dat_boot_calib, year == 2010) for (att in c("period", "weights", "b.rep")) attr(dat2, att) <- attr(dat_boot_calib, att)
To calculate the ratio of persons at risk of poverty for each federal state of Austria,
group = "region"
can be used.
povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, group = "region") povertyRates$Estimates
The last row with region = NA
denotes the aggregate over all regions. Note that the
columns N
and n
now show the weighted and unweighted number of persons in each region.
In case more than one grouping variable is used, there are several options of calling
calc.stError()
depending on whether combinations of grouping levels should be regarded or not.
We will consider the variables gender
and region
as our grouping variables and show three
options on how calc.stError()
can be called.
Calculate the point estimate and standard error for each region and each gender. The number of rows in the output is therefore
$$n_\text{periods}\cdot(n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9 + 2 + 1) = 12.$$
The last row is again the estimate for the whole period.
povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, group = c("gender", "region")) povertyRates$Estimates
region
and gender
Split the data by all combinations of the two grouping variables. This will result in a larger output-table of the size
$$n_\text{periods}\cdot(n_\text{regions} \cdot n_\text{genders} + 1) = 1\cdot(9\cdot2 + 1)= 19.$$
povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, group = list(c("gender", "region"))) povertyRates$Estimates
In this case, the estimates and standard errors are calculated for
The number of rows in the output is therefore
$$n_\text{periods}\cdot(n_\text{regions} \cdot n_\text{genders} + n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9\cdot2 + 9 + 2 + 1) = 30.$$
povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, group = list("gender", "region", c("gender", "region"))) povertyRates$Estimates
If differences between groups need to be calculated, e.g difference of poverty rates between gender = "male"
and gender = "female"
, parameter group.diff
can be utilised.
Setting group.diff = TRUE
the differences and the standard error of these differences for all variables defined in groups
will be calculated.
povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, group = c("gender", "region"), group.diff = TRUE) povertyRates$Estimates
The resulting output table contains r nrow(povertyRates$Estimates)
rows. r sum(povertyRates$Estimates$estimate_type == "direct")
rows for all the direct estimators
$$n_\text{periods}\cdot(n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9 + 2 + 1) = 12,$$
and another r sum(povertyRates$Estimates$estimate_type == "group difference")
for all the differences within the variable "gender"
and "region"
seperately.
Variable "gender"
has 2 unique values (unique(dat2$gender)
) resulting in 1 difference, ~ gender = "male"
- gender = "female"
and variable "region"
has
9 unique values (unique(dat2$region)
) resulting in
$$8 + 7 + 6 + 5 + 4 + 3 + 2 + 1 = \sum\limits_{1=1}^{9-1}i = 36$$
estimates. Thus the output contains 1 + 36 = 37 estimates with respect to group differences.
If a combintaion of grouping variables is used in group
and group.diff = TRUE
then differences between combinations will only be calculated
if one of the grouping variables differs.
For example the difference between the following groups would be calculated
gender = "female" & region = "Vienna"
- gender = "male" & region = "Vienna"
gender = "female" & region = "Vienna"
- gender = "female" & region = "Salzburg"
gender = "male" & region = "Salzburg"
- gender = "female" & region = "Salzburg"
The difference between gender = "female" & region = "Vienna"
and gender = "male" & region = "Salzburg"
however would not be calculated.
Thus this leads to
$$2\cdot(\sum\limits_{1=1}^{9-1}i) + 9\cdot1 = 81$$
results with respect to the differences.
The Output contains an additional column estimate_type
and
povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, group = list(c("gender", "region")), group.diff = TRUE) povertyRates$Estimates[,.N,by=.(estimate_type)]
Differences of estimates between period
s can be calculated using parameter period.diff
.
period.diff
expects a character vector (if not NULL
) specifying for which period
s the differences should be calcualed for.
The inputs should be specified in the form "period2" - "period1"
.
povertyRates <- calc.stError(dat_boot_calib[year>2013], var = "povertyRisk", fun = weightedRatio, period.diff = c("2017 - 2016", "2016 - 2015", "2015 - 2014")) povertyRates$Estimates
If additional grouping variables are supplied to calc.stError()
die differences across period
s are also carried out
for all variables in group
.
povertyRates <- calc.stError(dat_boot_calib[year>2013], var = "povertyRisk", fun = weightedRatio, group = "gender", period.diff = c("2017 - 2016", "2016 - 2015", "2015 - 2014")) povertyRates$Estimates
With parameter period.mean
averages across period
s are calculated additional. The parameter accepts only odd integer values.
The resulting table will contain the direct estimates as well as rolling averages of length period.mean
.
povertyRates <- calc.stError(dat_boot_calib[year>2013], var = "povertyRisk", fun = weightedRatio, period.mean = 3) povertyRates$Estimates
if in addition the parameters group
and/or period.diff
are specified then differences and groupings of averages will be calculated.
povertyRates <- calc.stError(dat_boot_calib[year>2013], var = "povertyRisk", fun = weightedRatio, period.mean = 3, period.diff = "2016 - 2015", group = "gender") povertyRates$Estimates
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.