Nothing
#' @title Calcualte point estimates and their standard errors using bootstrap
#' weights.
#'
#' @description
#' Calculate point estimates as well as standard errors of variables in surveys.
#' Standard errors are estimated using bootstrap weights (see [draw.bootstrap]
#' and [recalib]). In addition the standard error of an estimate can be
#' calcualted using the survey data for 3 or more consecutive periods, which
#' results in a reduction of the standard error.
#'
#' @param dat either data.frame or data.table containing the survey data.
#' Surveys can be a panel survey or rotating panel survey, but does not need
#' to be. For rotating panel survey bootstrap weights can be created using
#' [draw.bootstrap] and [recalib].
#' @param weights character specifying the name of the column in `dat`
#' containing the original sample weights. Used to calculate point estimates.
#' @param b.weights character vector specifying the names of the columns in
#' `dat` containing bootstrap weights. Used to calculate standard errors.
#' @param period character specifying the name of the column in `dat`
#' containing the sample periods.
#' @param var character vector containing variable names in `dat` on which `fun`
#' shall be applied for each sample period. If `var = NULL` the results will
#' reflect the sum of `weights`.
#' @param fun function which will be applied on `var` for each sample period.
#' Predefined functions are [weightedRatio], [weightedSum], but can also take
#' any other function which returns a double or integer and uses weights as
#' its second argument.
#' @param relative.share boolean, if `TRUE` point estimates resulting from `fun` will be
#' divided by the point estimate at population level per `period`.
#' @param group character vectors or list of character vectors containig
#' variables in `dat`. For each list entry `dat` will be split in subgroups
#' according to the containing variables as well as `period`. The
#' pointestimates are then estimated for each subgroup seperately. If
#' `group=NULL` the data will split into sample periods by default.
#' @param group.diff boolen, if `TRUE` differences and the standard error
#' between groups defined in `group` are calculated. See details for more explanations.
#' @param fun.adjust.var can be either `NULL` or a function. This argument can
#' be used to apply a function for each `period` and bootstrap weight to the
#' data. The resulting estimates will be passed down to `fun`. See details for
#' more explanations.
#' @param adjust.var can be either `NULL` or a character specifying the first
#' argument in `fun.adjust.var`.
#' @param period.diff character vectors, defining periods for which the
#' differences in the point estimate as well it's standard error is
#' calculated. Each entry must have the form of `"period1 - period2"`. Can be
#' NULL
#' @param period.mean odd integer, defining the range of periods over which the
#' sample mean of point estimates is additionally calcualted.
#' @param bias boolean, if `TRUE` the sample mean over the point estimates of
#' the bootstrap weights is returned.
#' @param size.limit integer defining a lower bound on the number of
#' observations on `dat` in each group defined by `period` and the entries in
#' `group`. Warnings are returned if the number of observations in a subgroup
#' falls below `size.limit`. In addition the concerned groups are available in
#' the function output.
#' @param cv.limit non-negativ value defining a upper bound for the standard
#' error in relation to the point estimate. If this relation exceed
#' `cv.limit`, for a point estimate, they are flagged and available in the
#' function output.
#' @param p numeric vector containing values between 0 and 1. Defines which
#' quantiles for the distribution of `var` are additionally estimated.
#' @param add.arg additional arguments which will be passed to fun. Can be
#' either a named list or vector. The names of the object correspond to the
#' function arguments and the values to column names in dat, see also
#' examples.
#' @param national DEPRECATED use `relative.share` instead! boolean, if TRUE point estimates resulting from fun will be
#' divided by the point estimate at the national level.
#' @param new_method used for testing new implementation; will be removed with new release.
#' @details `calc.stError` takes survey data (`dat`) and returns point estimates
#' as well as their standard Errors defined by `fun` and `var` for each sample
#' period in `dat`. `dat` must be household data where household members
#' correspond to multiple rows with the same household identifier. The data
#' should at least contain the following columns:
#'
#' * Column indicating the sample period;
#' * Column indicating the household ID;
#' * Column containing the household sample weights;
#' * Columns which contain the bootstrap weights (see output of [recalib]);
#' * Columns listed in `var` as well as in `group`
#'
#' For each variable in `var` as well as sample period the function `fun` is
#' applied using the original as well as the bootstrap sample weights.\cr
#' The point estimate is then selected as the result of `fun` when using the
#' original sample weights and it's standard error is estimated with the result
#' of `fun` using the bootstrap sample weights. \cr
#' \cr
#' `fun` can be any function which returns a double or integer and uses sample
#' weights as it's second argument. The predifined options are `weightedRatio`
#' and `weightedSum`.\cr
#' \cr
#' For the option `weightedRatio` a weighted ratio (in \%) of `var` is
#' calculated for `var` equal to 1, e.g
#' `sum(weight[var==1])/sum(weight[!is.na(var)])*100`.\cr
#' Additionally using the option `national=TRUE` the weighted ratio (in \%) is
#' divided by the weighted ratio at the national level for each `period`.
#' \cr
#' If `group` is not `NULL` but a vector of variables from `dat` then `fun` is
#' applied on each subset of `dat` defined by all combinations of values in
#' `group`.\cr
#' For instance if `group = "sex"` with "sex" having the values "Male" and
#' "Female" in `dat` the point estimate and standard error is calculated on the
#' subsets of `dat` with only "Male" or "Female" value for "sex". This is done
#' for each value of `period`. For variables in `group` which have `NA`s in
#' `dat` the rows containing the missings will be discarded. \cr
#' When `group` is a list of character vectors, subsets of `dat` and the
#' following estimation of the point estimate, including the estimate for the
#' standard error, are calculated for each list entry.\cr
#' \cr
#' If `group.diff = TRUE` difference between groups definded by `group` are calculated.
#' Differences are only calculated within each variables of `group`,
#' e.g `group = c("gender", "region")` will calcualate estimates of each group and
#' also differences within `"gender"` and `"region"` seperately.
#' If grouping is done with multiple variables e.g `group = list(c("gender","region")`)
#' (~ grouping by `"gender"` x `"region"`) differences are calculated
#' only between groups where one of the grouping variables is different.
#' For instance the difference between `gender = "female" & region = "Vienna"` and
#' `gender = "male" & region = "Vienna"` OR `gender = "female" & region = "Vienna"` and
#' `gender = "female" & region = "Salzburg"` will be calculated.
#' The difference between `gender = "female" & region = "Vienna"` and
#' `gender = "male" & region = "Salzburg"` will not be calculated. The order of difference
#' is determined by order of value (alpha-numerical order) or
#' if grouping contains factor variables the factor levels determin the order.
#' \cr
#' The optional parameters `fun.adjust.var` and `adjust.var` can be used if the
#' values in `var` are dependent on the `weights`. As is for instance the case
#' for the poverty thershhold calculated from EU-SILC.
#' In such a case an additional function can be supplied using `fun.adjust.var`
#' as well as its first argument `adjust.var`, which needs to be part of the
#' data set `dat`. Then, before applying `fun` on variable `var`
#' for all `period` and groups, the function `fun.adjust.var` is applied to
#' `adjust.var` using each of the bootstrap weights seperately (NOTE: weight is
#' used as the second argument of `fun.adjust.var`).
#' Thus creating i=1,...,`length(b.weights)` additional variables.
#' For applying `fun` on `var` the estimates for the bootstrap replicate will
#' now use each of the corresponding new additional variables. So instead of
#' \deqn{fun(var,weights,...),fun(var,b.weights[1],...),
#' fun(var,b.weights[2],...),...}
#' the function `fun` will be applied in the way
#' \deqn{fun(var,weights,...),fun(var.1,b.weights[1],...),fun(var.2,
#' b.weights[2],...),...}
#'
#' where `var.1`, `var.2`, `...` correspond to the estimates resulting from
#' `fun.adjust.var` and `adjust.var`.
#' NOTE: This procedure is especially usefull if the `var` is dependent on
#' `weights` and `fun` is applied on subgroups of the data set. Then it is not
#' possible to capture this procedure with `fun` and `var`, see examples for a
#' more hands on explanation.
#' \cr
#' When defining `period.diff` the difference of point estimates between periods
#' as well their standard errors are calculated.\cr
#' The entries in `period.diff` must have the form of `"period1 - period2"`
#' which means that the results of the point estimates for `period2` will be
#' substracted from the results of the point estimates for `period1`.\cr
#' \cr
#' Specifying `period.mean` leads to an improvement in standard error by
#' averaging the results for the point estimates, using the bootstrap weights,
#' over `period.mean` periods.
#' Setting, for instance, `period.mean = 3` the results in averaging these
#' results over each consecutive set of 3 periods.\cr
#' Estimating the standard error over these averages gives an improved estimate
#' of the standard error for the central period, which was used for
#' averaging.\cr
#' The averaging of the results is also applied in differences of point
#' estimates. For instance defining `period.diff = "2015-2009"` and
#' `period.mean = 3`
#' the differences in point estimates of 2015 and 2009, 2016 and 2010 as well as
#' 2014 and 2008 are calcualated and finally the average over these 3
#' differences is calculated.
#' The periods set in `period.diff` are always used as the middle periods around
#' which the mean over `period.mean` years is build.
#' \cr
#' Setting `bias` to `TRUE` returns the calculation of a mean over the results
#' from the bootstrap replicates. In the output the corresponding columns is
#' labeled *_mean* at the end.\cr
#' \cr
#' If `fun` needs more arguments they can be supplied in `add.arg`. This can
#' either be a named list or vector.\cr
#' \cr
#' The parameter `size.limit` indicates a lower bound of the sample size for
#' subsets in `dat` created by `group`. If the sample size of a subset falls
#' below `size.limit` a warning will be displayed.\cr
#' In addition all subsets for which this is the case can be selected from the
#' output of `calc.stError` with `$smallGroups`.\cr
#' With the parameter `cv.limit` one can set an upper bound on the coefficient
#' of variantion. Estimates which exceed this bound are flagged with `TRUE` and
#' are available in the function output with `$cvHigh`.
#' `cv.limit` must be a positive integer and is treated internally as \%, e.g.
#' for `cv.limit=1` the estimate will be flagged if the coefficient of
#' variantion exceeds 1\%.\cr
#' \cr
#' When specifying `period.mean`, the decrease in standard error for choosing
#' this method is internally calcualted and a rough estimate for an implied
#' increase in sample size is available in the output with `$stEDecrease`.
#' The rough estimate for the increase in sample size uses the fact that for a
#' sample of size \eqn{n} the sample estimate for the standard error of most
#' point estimates converges with a factor \eqn{1/\sqrt{n}} against the true
#' standard error \eqn{\sigma}.
#'
#'
#' @return Returns a list containing:
#'
#' * `Estimates`: data.table containing period differences and/or k period
#' averages for estimates of
#' `fun` applied to `var` as well as the corresponding standard errors, which
#' are calculated using the bootstrap weights. In addition the sample size,
#' `n`, and poplutaion size for each group is added to the output.
#' * `smallGroups`: data.table containing groups for which the number of
#' observation falls below `size.limit`.
#' * `cvHigh`: data.table containing a boolean variable which indicates for each
#' estimate if the estimated standard error exceeds `cv.limit`.
#' * `stEDecrease`: data.table indicating for each estimate the theoretical
#' increase in sample size which is gained when averaging over k periods. Only
#' returned if `period.mean` is not `NULL`.
#'
#' @seealso [draw.bootstrap] \cr
#' [recalib]
#' @keywords survey manip
#' @author Johannes Gussenbauer, Alexander Kowarik, Statistics Austria
#'
#' @examples
#' # Import data and calibrate
#'
#' library(surveysd)
#' library(data.table)
#' setDTthreads(1)
#' set.seed(1234)
#' eusilc <- demo.eusilc(n = 4,prettyNames = TRUE)
#' dat_boot <- draw.bootstrap(eusilc, REP = 3, hid = "hid", weights = "pWeight",
#' strata = "region", period = "year")
#' dat_boot_calib <- recalib(dat_boot, conP.var = "gender", conH.var = "region")
#'
#' # estimate weightedRatio for povertyRisk per period
#'
#' err.est <- calc.stError(dat_boot_calib, var = "povertyRisk",
#' fun = weightedRatio)
#' err.est$Estimates
#'
#' # calculate weightedRatio for povertyRisk and fraction of one-person
#' # households per period
#'
#' dat_boot_calib[, onePerson := .N == 1, by = .(year, hid)]
#' err.est <- calc.stError(dat_boot_calib, var = c("povertyRisk", "onePerson"),
#' fun = weightedRatio)
#' err.est$Estimates
#'
#'
#' # estimate weightedRatio for povertyRisk per period and gender and
#' # period x region x gender
#'
#' group <- list("gender", c("gender", "region"))
#' err.est <- calc.stError(dat_boot_calib, var = "povertyRisk",
#' fun = weightedRatio, group = group)
#' err.est$Estimates
#'
#' # use average over 3 periods for standard error estimation
#' # and calculate estimate for difference of
#' # period 2011 and 2012 inclulding standard errors
#' period.diff <- c("2012-2011")
#' err.est <- calc.stError(
#' dat_boot_calib, var = "povertyRisk", fun = weightedRatio,
#' period.diff = period.diff, # <- take difference of periods 2012 and 2011
#' period.mean = 3) # <- average over 3 periods
#' err.est$Estimates
#'
#' # for more examples see https://statistikat.github.io/surveysd/articles/error_estimation.html
#'
#' @export calc.stError
#'
# wrapper-function to apply fun to var using weights (~weights, b.weights)
# and calculating standard devation (using the bootstrap replicates) per period
# and for k-period rolling means
calc.stError <- function(
dat, weights = attr(dat, "weights"), b.weights = attr(dat, "b.rep"),
period = attr(dat, "period"), var = NULL, fun = weightedRatio,
relative.share = FALSE,
group = NULL, group.diff = FALSE, fun.adjust.var = NULL, adjust.var = NULL,
period.diff = NULL, period.mean = NULL, bias = FALSE,
size.limit = 20, cv.limit = 10, p = NULL,
add.arg = NULL, national = FALSE, new_method = TRUE) {
stE_high <- stE <- val <- as.formula <- est_type <- n_inc <-
stE_roll <- n <- size <- est <- estimate_type <- NULL
##########################################################
# INPUT CHECKING
if (is.data.frame(dat)) {
dat <- as.data.table(dat)
}else if (!is.data.table(dat)) {
stop("dat must be a data.frame or data.table")
}
c.names <- colnames(dat)
# check weights
check.input(weights, "weights", input.length = 1, input.type = "character",
c.names = c.names, dat = dat, dat.column.type = "numeric")
# check b.weights
check.input(b.weights, "b.weights", input.type = "character",
c.names = c.names, dat = dat, dat.column.type = "numeric")
if (any(!grepl("^[[:alpha:]]", b.weights)))
stop("Column names of bootstrap replicates must start with alphabetic ",
"character")
# check period
removeCols <- c()
periodNULL <- is.null(period)
if (periodNULL) {
period <- generateRandomName(20, colnames(dat))
dat[, c(period) := 1]
removeCols <- c(removeCols, period)
}
check.input(period, period, input.length = 1, input.type = "character",
c.names = c.names)
# check var
if(is.null(var)){
var <- generateRandomName(20, colnames(dat))
dat[, c(var) := 1]
removeCols <- c(removeCols, var)
}
check.input(var, "var", c.names = c.names, dat = dat, dat.column.type = c("numeric","logical"))
# check national/realtive.share
check.input(relative.share, "relative.share", input.length = 1, input.type = "logical")
check.input(national, "national", input.length = 1, input.type = "logical")
if(national == TRUE){
warning("Parameter 'national' will be deprecated soon use paremter 'relative.share' instead!")
relative.share <- national
}
# check fun and add.arg
check.input(fun, "fun", input.type = "function")
if (!is.null(add.arg)) {
if (!is.list(add.arg) | !is.vector(add.arg))
stop("add.arg needs to be a list or vector")
if (length(names(add.arg)) == 0)
stop("add.arg needs to be a named list or vector, names(add.arg) <- ?")
if (any(!names(add.arg) %in% formalArgs(fun))) {
notInFun <- !names(add.arg) %in% formalArgs(fun)
stop(paste(names(add.arg)[notInFun], collapse = " "),
" not argument(s) of supplied function.")
}
if (any(!unlist(add.arg) %in% c.names)) {
notInData <- unlist(add.arg)
notInData <- notInData[!notInData %in% c.names]
stop(paste(notInData, collapse = " "), " not in column names of dat.")
}
add.arg <- as.list(add.arg)
}
fun_names <- names(formals(fun))
eval_args <- c(list(var[1], weights), add.arg) # c(list(fun = fun, var = var, weights = weights), add.arg)
names(eval_args)[1:2] <- fun_names[1:2]
test.val <- dat[,do.call(fun, args = .SD), env = list(.SD = eval_args)]
if (!is.numeric(test.val) & !is.integer(test.val))
stop("Function in fun does not return integer or numeric value")
if (length(test.val) > 1)
stop("Function in fun does return more than one value. Only functions ",
"which return a single value are allowed.")
# check fun.adjust.var
if (!is.null(fun.adjust.var) & !is.null(adjust.var)) {
check.input(fun.adjust.var, "fun.adjust.var", input.type = "function")
if(length(var)!=length(adjust.var)){
stop("If adjust.var is not NULL, adjust.var and var must have the same length!")
}
check.input(adjust.var, "adjust.var", input.length = length(var), input.type =
"character", c.names = c.names,
dat = dat, dat.column.type = c("numeric","logical"))
fun_names <- names(formals(fun.adjust.var))
eval_args <- list(adjust.var[1], weights) # c(list(fun = fun, var = var, weights = weights), add.arg)
names(eval_args)[1:2] <- fun_names[1:2]
test.val <- dat[,do.call(fun.adjust.var, args = .SD), env = list(.SD = eval_args)]
if (!is.numeric(test.val) & !is.integer(test.val))
stop("Function in fun.adjust.var does not return integer ",
"or numeric value")
}else{
if((is.null(fun.adjust.var) & !is.null(adjust.var))|
(!is.null(fun.adjust.var) & is.null(adjust.var))){
stop("adjust.var and fun.adjust.var must be a column in dat and a function OR both be NULL")
}
}
# check group
if (is.null(group))
group <- list(NULL)
if (!is.list(group))
group <- as.list(group)
if (!any(unlist(lapply(group, is.null))))
group <- c(list(NULL), group)
if (any(!unlist(group) %in% c.names))
stop("Not all elements on group are column names in dat")
# check period.mean
if (!is.null(period.mean)) {
check.input(period.mean, "period.mean", input.length = 1, input.type = "numeric")
if (period.mean %% 2 == 0 & period.mean > 0) {
warning("period.mean must be odd - average over periods",
" will not be calculated")
period.mean <- NULL
} else {
if (period.mean <= 1)
period.mean <- NULL
}
}
# check bias
check.input(bias, "bias", input.length = 1, input.type = "logical")
# check size.limit
check.input(size.limit, "size.limit", input.length = 1, input.type = "numeric")
# check cv.limit
check.input(cv.limit, "cv.limit", input.length = 1, input.type = "numeric")
# check p
if (!is.null(p)) {
check.input(p, "p", input.type = "numeric")
if (any(!p %between% c(0, 1)))
stop("Values in p must be between 0 and 1")
}
# check period.diff
if (!is.null(period.diff)) {
periods.dat <- dat[,unique(period), env = list(period = period)]
period.diff <- strsplit(period.diff, "-")
period.diff <- lapply(period.diff, trimws)
rm.index <- rep(0, length(period.diff))
for (i in seq_along(period.diff)) {
if (any(!period.diff[[i]] %in% periods.dat)) {
warning("Removing ", paste(period.diff[[i]], collapse = "-"),
" from period.diff - period(s) not present in column ",
period, "\n")
rm.index[i] <- 1
}
}
if (all(rm.index == 1)) {
warning("No differences will be calculated\n")
period.diff <- NULL
} else {
period.diff <- period.diff[rm.index == 0]
}
}
##########################################################
##########################################################
# setup parameters
# define columns in which NAs are present (will be discarded for
# the evaluation)
col_cross <- unique(unlist(group))
if (!is.null(col_cross)) {
no.na <- unlist(dat[, lapply(.SD,
function(z) {
all(!is.na(z))
}),
.SDcols = col_cross])
no.na <- names(no.na)[!no.na]
if (length(no.na) > 0) {
print.no.na <- paste0(no.na, collapse = ", ")
warning("Missing values found in column name(s) ", print.no.na,
"\n Cells with missing values are discarded for the",
"calculation!\n")
}
} else {
no.na <- NULL
}
if (!is.null(p)) {
p.names <- paste0("p", p)
} else {
p.names <- NULL
}
# calculate point estimates
outx <- run.stError(
dat = dat, period = period, var = var, weights = weights,
b.weights = b.weights, fun = fun, relative.share = relative.share,
group = group, group.diff = group.diff,
fun.adjust.var = fun.adjust.var, adjust.var = adjust.var,
period.diff = period.diff, period.mean = period.mean, bias = bias,
no.na = no.na, size.limit = size.limit, p = p, add.arg = add.arg)
# remove columns
if (length(removeCols) > 0) {
dat[, c(removeCols) := NULL]
removeCols2 <- removeCols[removeCols %in% colnames(outx)]
outx[, c(removeCols2) := NULL]
if(var %in% removeCols){
outx[,est:="N"]
var <- "N"
}
}
outx[,estimate_type := fcase(est_type == "roll", "period average",
est_type == "diff", "period difference",
est_type == "diff_mean", "difference between period averages",
est_type == "diff_group", "group difference",
default = "direct")]
outx.names <- colnames(outx)
outx.names <- outx.names[!outx.names %in% c("val", "est_type", "stE", "mean",
"size", p.names)]
# get meta data like stE_high - size - increase in effektive sample size
# flag stE if values are especially high
outx[, stE_high := ((stE / val) * 100) > cv.limit]
# create bool matrix for stE_high
sd_bool <- subset(
outx,
select = unique(c("stE_high", outx.names[!outx.names %in% c("N", "n")]))
)
form <- as.formula(paste(
paste(outx.names[!outx.names %in% c("N", "n", "est")], collapse = "+"),
"est", sep = "~"))
sd_bool <- dcast(sd_bool, form, value.var = "stE_high")
# create matrix for increase of sample size
if (nrow(outx[est_type == "roll"]) > 0) {
# estimate (roughly) the effektive sample size per
samp_eff <- outx[est_type == "roll"]
samp_eff[,period := tstrsplit(period, "_", keep = 2), env = list(period = period)]
if (bias) {
samp_eff[, "mean" := NULL]
}
samp_eff[, c("size", "val", "est_type", "N", "n", "stE_high") := NULL]
setnames(samp_eff, "stE", "stE_roll")
same_names <- intersect(colnames(samp_eff), colnames(outx))
samp_eff <- merge(samp_eff, outx[, mget(c("stE", "n", same_names))],
by = same_names)
samp_eff[, n_inc := ((stE / stE_roll) ^ 2 - 1) * n]
samp_eff[, c("stE_roll", "stE", "n") := NULL]
} else {
samp_eff <- NULL
}
# create Matrix for groups which have small sizes
size_group <- unique(subset(
outx[size == TRUE],
select = c(outx.names[!outx.names %in% c("est", "N")])
))
val.var <- c("val", "stE", p.names)
if (bias) {
val.var <- c(val.var, "mean")
}
form_names <- outx.names[!outx.names %in% c("est")]
form <- as.formula(paste(
paste(form_names, collapse = "+"), "est", sep = "~"))
outx <- dcast(outx, form, value.var = val.var, fill = NA)
# reorder output
col.order <- as.vector(outer(paste0(val.var, "_"), var, FUN = "paste0"))
outx.names <- colnames(outx)
col.order <- c(outx.names[!outx.names %in% col.order], col.order)
col.order <- col.order[col.order %in% outx.names]
setcolorder(outx, col.order)
param <- list(
number.bweights = length(b.weights), period = period, var = var,
fun = fun, fun.adjust.var = fun.adjust.var, adjust.var = adjust.var,
group = group, group.diff = group.diff,
period.diff = period.diff, period.mean = period.mean,
bias = bias, size.limit = size.limit, cv.limit = cv.limit, add.arg)
output <- list(Estimates = outx, smallGroups = size_group, cvHigh = sd_bool,
stEDecrease = samp_eff, param = param)
class(output) <- c("surveysd", class(output))
return(output)
}
run.stError <- function(dat, period, var, weights, b.weights = paste0("w", 1:1000), fun, relative.share,
group, group.diff = NULL, period.diff = NULL, period.mean = NULL,
fun.adjust.var = NULL, adjust.var = NULL,
bias = FALSE, no.na, size.limit = 20, p = NULL, add.arg){
.SDx <- .SDw <- . <- .SD_add.args <- keep_row <- .SDy <- est_type <-
V1 <- N <- n <- difference_group_value <- help_direct_estimate <-
x <- size <- NULL
# melt data set for easier data manipulation if var holds multiple values
id.vars <- unique(c(period, unlist(group), unlist(add.arg), weights, b.weights))
id.vars <- which(colnames(dat) %in% id.vars)
m.vars <- paste(var,collapse="|")
m.vars <- paste0("^(",m.vars,")$")
m.vars <- list(m.vars) # used in patterns() for melting
name_var <- paste0("name_var_",generateRandomName(nchar = 10, existingNames = id.vars))
value_var <- paste0("value_var_",generateRandomName(nchar = 10, existingNames = id.vars))
value_adjust.var <- NULL
if(!is.null(adjust.var)){
value_adjust.var <- paste0("value_adjust.var_",generateRandomName(nchar = 10, existingNames = id.vars))
m.vars <- c(m.vars, list(paste0("^(",paste(adjust.var, collapse="|"),")$"))) # must be column index for melting
}
dat <- melt(dat,id.vars = id.vars, measure.vars = patterns(m.vars),
variable.name = c(name_var), value.name = c(value_var, value_adjust.var), variable.factor = FALSE)
if(!is.null(adjust.var)){
dat[, name_var := var[as.integer(name_var)], env = list(name_var = name_var)]
}
# apply fun.adjust.var
b.value_var <- NULL
if(!is.null(fun.adjust.var)){
b.value_var <- paste(value_var,b.weights,sep=".")
dat[,c(b.value_var) := mapply(fun.adjust.var, .SDx, .SDw, SIMPLIFY = FALSE), by=.(period,name_var),
env = list(period = period,
name_var = name_var,
.SDx = as.list(value_adjust.var),
.SDw = as.list(b.weights))]
}
b.value_var <- c(value_var,b.value_var)
# create point estimate for subnational result in relation to national level
if (relative.share) {
national_est <- paste("Nat",c(weights,b.weights), sep=".")
dat[,c(national_est) := mapply(fun, .SDx, .SDw, MoreArgs = .SD_add.args, SIMPLIFY = FALSE), by=c(period, name_var),
env = list(
.SDx = as.list(b.value_var),
.SDw = as.list(c(weights,b.weights)),
.SD_add.args = add.arg)]
}
# define parameter for quantile calculation
if (!is.null(p)) {
p.names <- paste0("p", p)
np <- length(p.names)
}
# define additional parameters for mean over consecutive periods
periods <- sort(unique(dat[[period]]))
if (!is.null(period.mean)) {
# formulate k consecutive periods
if (length(periods) >= period.mean & period.mean %% 2 == 1) {
periodsList <- unlist(lapply(
periods[1:c(length(periods) - period.mean + 1)],
function(z) {
if (!((z + period.mean - 1) > max(periods))) {
paste(z:c(z + period.mean - 1), collapse = "_")
}
}
))
} else{
periodsList <- NULL
warning("Not enough periods present in data to calculate mean over ",
period.mean, " periods.\n")
period.mean <- NULL
}
# get periods for k period mean over period-differences
# get for each difference a list of vectors that correspond to the
# differences needed to use for the mean over differences
if (!is.null(period.diff)) {
period.diff.b <- TRUE
period.diff.mean <- lapply(period.diff, function(z) {
z <- as.numeric(z)
steps <- (period.mean - 1) / 2
z_upper <- (z[1] + steps):(z[1] - steps)
z_lower <- (z[2] + steps):(z[2] - steps)
diff.feasable <- all(c(z_lower, z_upper) %in% periods)
if (diff.feasable) {
lapply(1:period.mean, function(s) {
c(z_upper[s], z_lower[s])
})
} else {
warning("Cannot calculate differences between periods ", z[1],
" and ", z[2], " over ", period.mean, " periods.\n")
NULL
}
})
period.diff.mean[unlist(lapply(period.diff.mean, is.null))] <- NULL
} else {
period.diff.b <- FALSE
period.diff.mean <- NULL
}
} else {
if (is.null(period.diff)) {
period.diff.b <- FALSE
} else {
period.diff.b <- TRUE
}
period.diff.mean <- NULL
periodsList <- NULL
}
# variable name for estimate
var_est <- generateRandomName(existingNames = c(unique(unlist(group)), "n" ,"N"))
name_weights <- paste0("name_weights_", generateRandomName(nchar = 10, existingNames = id.vars))
# iterate of groupings
out <- lapply(group, function(z) {
# use only unique values for grouping (duplicates are discarded)
# if period in z also discard -> always group by period
z <- unique(z[!z == period])
z <- as.character(z)
na.check <- z[z %in% no.na]
if(length(na.check)>0){
dat[, keep_row := rowSums(is.na(.SD))==0, .SDcols=c(na.check)]
}else{
dat[, keep_row := TRUE]
}
by.eval <- c(name_var, period, z)
# ---------------------------------
# calculate estimate
var_value <- paste0("var",1:length(c(weights, b.weights)))
if(relative.share){
var.est <- dat[keep_row == TRUE, c(.(n=.N, N = sum(weights)),
var=mapply(fun, .SDx, .SDw, MoreArgs = .SD_add.args, SIMPLIFY = FALSE),
head(.SD,1)),
by=c(by.eval),
env = list(weights = weights,
.SDx = as.list(b.value_var),
.SDw = as.list(c(weights, b.weights)),
.SD_add.args = add.arg),
.SDcols = c(national_est)]
var.est[,c(var_value) := mapply(`/`,.SDx,.SDy, SIMPLIFY = FALSE),
env = list(.SDx = as.list(var_value),
.SDy = as.list(national_est))]
}else{
var.est <- dat[keep_row == TRUE, c(.(n=.N, N = sum(weights)),
var=mapply(fun, .SDx, .SDw, MoreArgs = .SD_add.args, SIMPLIFY = FALSE)),
by=c(by.eval),
env = list(weights = weights,
.SDx = as.list(b.value_var),
.SDw = as.list(c(weights, b.weights)),
.SD_add.args = add.arg)]
}
var.est <- melt(var.est, id.vars = c(by.eval,"n","N"), measure.vars = var_value,
value.name = var_est, variable.name = name_weights)
var.est[,name_weights := factor(name_weights, labels = c(weights, b.weights)), env = list(name_weights = name_weights)]
var.est[, est_type := "norm"]
by.eval <- c(by.eval, name_weights)
# add groups which are not in var.est
# because they were not present in sample
if (!is.null(period.mean) | period.diff.b) {
# if group did not exist in period
# add group with NA
roll.miss <- unique(subset(dat[keep_row == TRUE],
select = c(name_var, period, z)))
roll.miss <- lapply(
roll.miss,
function(l) {
unique(na.omit(l))
}
)
roll.est <- data.table(expand.grid(roll.miss))
if (nrow(roll.est) > nrow(var.est)) {
var.est <- merge(roll.est, var.est, by = c(by.eval),
all.x = TRUE)
setkeyv(var.est, period)
var.est[is.na(V1), N := 0]
var.est[is.na(V1), n := 0]
}
}
# calculate average of estimate over 'period.mean'- periods
if (!is.null(periodsList)) {
roll.est <- var.est[, list(
var_est = rollMeanC(var_est, period.mean, type = "c"),
period = periodsList
), by = c(z, name_var, name_weights),
env = list(var_est = var_est,
period = period)]
roll.est[, est_type := "roll"]
roll.Nn <- var.est[name_weights == weights, list(
N = rollMeanC(x = N, k = period.mean, type = "c"),
n = rollMeanC(x = n, k = period.mean, type = "c"),
period = periodsList),
by = c(z, name_var), env = list(name_weights = name_weights,
period = period)]
# merge results
roll.est <- merge(roll.est, roll.Nn, by = c(z, period, name_var), all.x = TRUE)
}
# ----------------------------------
# differences between selected groups within years and across and years
# differences between periods
if (period.diff.b) {
# calcualte differences between periods
diff.est <- lapply(period.diff,function(y){
help_diff_group(var.est = var.est, by.eval = by.eval,
var_est = var_est, diff.group = y, diff.column = period)
})
diff.est <- rbindlist(diff.est)
diff.est[, est_type := "diff"]
# calcualte differences between periods and mean over consecutive
# differences
if (!is.null(unlist(period.diff.mean))) {
i.diff.mean <- (period.mean + 1) / 2
# i.diff.mean <- 1 # nolint
diff.mean.est <- lapply(period.diff.mean, function(d) {
# calculate differences for all pairwise periods in d
d.m.est <- lapply(d,function(y){
help_diff_group(var.est = var.est, by.eval = by.eval,
var_est = var_est, diff.group = y, diff.column = period)
})
# calcualte mean over consecutive differences
d.m.est <- rbindlist(d.m.est)
d.m.est <- d.m.est[,.(var_est = mean(var_est), n = mean(n), N=mean(N)),
by = c(z, name_var, name_weights), env = list(var_est = var_est)] #
d.m.est[, c(period) := paste0(paste(d[[i.diff.mean]], collapse = "-"),
"_mean")]
})
diff.mean.est <- rbindlist(diff.mean.est)
diff.mean.est[, est_type := "diff_mean"]
}
}
# difference between group within period
if(group.diff == TRUE & length(z)>0){
diff_group_col <- paste(z, collapse = "_")
z_combinations <- unique(subset(dat[keep_row == TRUE], select = z))
setorderv(z_combinations, z)
var.est[,difference_group_value := do.call(paste,c(.SD, sep="_")), .SDcols=c(z)]
z_combinations[,difference_group_value := do.call(paste,c(.SD, sep="_")), .SDcols=c(z)]
diff.group.est <- list()
for(i in 1:(nrow(z_combinations)-1)){
for(j in (i+1):nrow(z_combinations)){
# allow only differences between group if all but one values are the same
# eg. male age (20;30] - female age(40;50] is not supported
check_grouping <- z_combinations[i,.SD,.SDcols=c(z)] == z_combinations[j,.SD,.SDcols=c(z)]
if(sum(check_grouping==FALSE)!=1){
next # skip this combination
}
ij_difference <- z_combinations[c(i,j),lapply(.SD, function(z){paste(unique(z), collapse = " - ")}), .SDcols=c(z)]
y <- c(z_combinations[c(i,j),difference_group_value])
by.eval_help <- by.eval[!by.eval %in%z]
i.j.diff <- help_diff_group(var.est = var.est, by.eval = by.eval_help,
var_est = var_est, diff.group = y,
diff.column = "difference_group_value")
i.j.diff[,c(z) := ij_difference]
i.j.diff[,difference_group_value := NULL]
diff.group.est <- c(diff.group.est, list(i.j.diff))
}
}
diff.group.est <- rbindlist(diff.group.est)
diff.group.est[, est_type := "diff_group"]
var.est[,difference_group_value := NULL]
}
# add results to one data.table
if (!is.null(period.mean)) {
var.est <- rbind(var.est, roll.est, fill = TRUE, ignore.attr=TRUE)
}
if (period.diff.b) {
var.est <- rbind(var.est, diff.est, fill = TRUE, ignore.attr=TRUE)
if (!is.null(unlist(period.diff.mean))) {
var.est <- rbind(var.est, diff.mean.est, fill = TRUE, ignore.attr=TRUE)
}
}
if(group.diff == TRUE & length(z)>0){
var.est <- rbind(var.est, diff.group.est, fill = TRUE, ignore.attr=TRUE)
}
# ----------------------------------
# specify small groups
var.est[, help_direct_estimate := name_weights == weights, env = list(name_weights = name_weights)]
var.est.small <- var.est[help_direct_estimate == TRUE & N < size.limit]
if (nrow(var.est.small) > 0) {
small.group <- var.est.small[, .SD, .SDcols=c(period, z)]
cat(paste0("For grouping by ", paste(c(period, z), collapse = "~"),
": \n"))
if (nrow(var.est.small[N < size.limit]) > 10) {
cat(paste0("Sample size lower ", size.limit, " for ",
nrow(var.est.small[N < size.limit]), " groups \n"))
} else {
cat(paste0("Sample size lower ", size.limit, " for groups \n"))
print(small.group)
}
}
by.eval2 <- c(period, z)
if (!is.null(p)) {
sd.est <- var.est[help_direct_estimate == FALSE, as.list(
c(stE = sd(x), quantileNA(x, probs = p, p.names = p.names, np = np))),
by = c(by.eval2), env = list(x = var_est)]
} else {
sd.est <- var.est[help_direct_estimate == FALSE, .(stE = sd(x)), by = c(by.eval2),
env = list(x = var_est)]
}
out.z <- merge(var.est[help_direct_estimate == TRUE], sd.est, by = c(by.eval2))
if (bias) {
bias.est <- var.est[help_direct_estimate == FALSE, .(mean = mean(x)),
by = c(period, z), env = list(x = var_est)]
out.z <- merge(out.z, bias.est, by = c(period, z))
}
# define size groups - groups with zero size do not fall under size-output
# for groups with zero size the resutling estimatese will be NAs
out.z[(!is.na(n)) & n > 0, size := n < size.limit]
return(out.z)
})
out <- rbindlist(out, fill = TRUE, use.names = TRUE)
setnames(out, c(var_est, name_var), c("val","est"))
out[,c(name_weights,"help_direct_estimate"):=NULL]
return(out)
}
help_diff_group <- function(var.est, by.eval, var_est, diff.group, diff.column){
. <- N <- n <- NULL
by.eval <- by.eval[!by.eval %in% diff.column]
var.est.diff <- var.est[diff.column %in% diff.group, env = list(diff.column = diff.column)]
var.est.diff[,diff.column := factor(diff.column, levels = diff.group), env = list(diff.column = diff.column)]
setorderv(var.est.diff,diff.column,order = c(-1)) # diff() ~ x_1-x_0
var.est.diff <- var.est.diff[,.(N=mean(N), n=mean(n), var_est = diff(var_est)),
by = c(by.eval), env = list(var_est = var_est)]
var.est.diff[, c(diff.column) := paste(diff.group, collapse = "-")]
var.est.diff <- subset(var.est.diff, select = c(diff.column, by.eval, "n", "N", var_est))
return(var.est.diff)
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.