harvest: Generate iterative raking weights for your data.

View source: R/harvest.R

harvestR Documentation

Generate iterative raking weights for your data.

Description

This function implements a variation of iterative raking, as described in DeBell and Krosnick (2009). It replaces the anesrake function in the anesrake package and adds support for modern data types, a tidy workflow, additional user control, and faster estimation. harvest() is designed so that for most users, the default two-argument function call harvest(data, targets) will behave well, but almost every element of the process can be customized by users who want additional control.

Usage

harvest(data, target, start_weights = 1, max_weight = 5,
  max_iterations = 1000, select_params = c(pct = 0.05, count = 5),
  convergence = c(pct = 0.01, absolute = 1e-06, time = NULL, single_weight =
  NULL), select_function = "pct", error_function = "linear",
  verbose = FALSE, attach_weights = TRUE, weight_column = NULL,
  add_na_proportion = FALSE, target_map = NULL, enforce_mean = TRUE, ...)

Arguments

data

A data frame (tibble) or matrix containing data to be raked. The data can contain columns not used in the raking, but must contain all the columns used in the raking

target

A list of target proportions in the population of interest. This argument can be one of two formats: a list of named numeric vectors, or a data frame (tibble). If a data frame, see the target_map argument and the "Specifying target" of this documentation for more details on the data frame's format. No level may have a negative proportion or an NA, and each variable should sum to 1.

start_weights

Starting weights. This may either be a single positive number (which will be implicitly renormed to 1), or a vector of length n, where n is the number of rows in the data. No values in this vector may be NA, but some can be 0. Lovelace et al. (2015) found that initial weights generally have very little impact on final weight estimations. Selecting better initial weights may speed up convergence.

max_weight

A maximum value to clamp weights to. By default, as per DeBell and Krosnick (2009) and anesrake, this is set to 5. Note: When weights exceed max_weight, all weights are truncated to max_weight and then re-distributed to have mean 1. This means capped weights may sometimes exceed max_weight in order to preserve weight mean = 1. To override this, see documentation for enforce_mean

max_iterations

A maximum number of iterations per raking attempt. The default is 1,000. Note that the total number of iterations may exceed this number if after raking, additional variables display imbalance. Defaults in anesrake and Ipfp are 1,000. Default in rake is 10, but with considerably looser convergence critria.

select_params

A named vector of variable selection parameters. Which names to supply depends on the variable selection function. Parameters for built-in variable selection functions are described below.

convergence

A named vector of convergence parameters. These are described below but the defaults are well-tuned for both speed and convergence to population marginals.

select_function

Specification of error function (how we measure which variables to rake on). This can either be a character vector specifying a built-in function, or a function closure (unquoted name of a function) which calculates a custom selection method. The built-in options are "threshold" (default), "all", "number", "lesser", or "greater". You can read more about these options below. Discussion of custom selection functions is also found below.

error_function

Specification of error function (how we measure how far off a variable is from its intended result). This can be either a character vector specifying a built-in function, or a function closure ( unquoted name of a function) which calculates a custom error rate. The built-in options are "linear" (default), "max", "squared", "mean", "maxsquared", and "meansquared". You can read more about these options below. "total", "average", "totalsquared", and "averagesquared" are also accepted for backwards compatibility with anesrake. Discussion of custom error functions is also found below.

verbose

Level of verbosity, defaults to FALSE. At TRUE or 1, the function begins emitting progress information. At 2, each iteration provides a significant amount of progress information.

attach_weights

A binary value, default TRUE. If FALSE, this function will return weights as a vector. If TRUE, this function will attach weights to the data frame provided and return the new data frame. The weights will be added in a column named "weights". If a column named "weights" is present, backup options will be used and the user will receive feedback.

weight_column

A quoted character vector specifying a name for the column attached if attach_weights is TRUE. If a column with this name already exists, it will be overwritten.

add_na_proportion

If TRUE, harvest will adjust the target proportions so that each variable has a proportion for missing data reflecting the missing data observed in the data sample. If a character vector, harvest will adjust the variables in the character vector but no others. If a numeric vector, harvest will adjust the variables based on num

target_map

Used only if target is a data frame, provides a mapping from columns of the data frame to variable, level, and proportion. Should be specified as a named vector, e.g. target_map = c("variable" = 1, "level" = 2, "proportion" = 3). The values of this vector can be either numeric column indexes or quoted character vectors of column names.

enforce_mean

Default TRUE. By default, weights minimize divergence from target proportions subject to two conditions: that the mean weight be 1, and that the maximum weight be capped at max_weight. Weights are first capped and then re-meaned. When enforce_mean is FALSE, the re-meaning does not occur. This will guarantee the maximum weight does not exceed max_weight but may result in mean weights diverging from 1. As max_weight prevents high-weight observations from becoming higher weighted, enforce_mean helps low-weight observations from becoming even lower weighted.

...

Additional arguments to this function are ignored

Details

The default parameterization of the function works very well. There should be little need to select the alternate calculation methods or tweak any of the parameters. To the extent that weights do not converge, this is likely to be a pathology of the data or the target proportions rather than the parameterization of iterative raking. Documentation below is primarily intended for advanced users who want to customize parameters in detail.

Value

The original data frame data augmentd with a new column containing the calculated weights if attach_weights is TRUE (default). If attach_weights is FALSE, a vector of numeric weights in the order of the supplied cases.

Convergence Parameters

By default, harvest() will return a warning if results do not convergence or only partial convergence is achieved. This normally occurs if rate of convergence slows before weights are stabilized. If this occurs, users can choose between altering parameters to force better convegence, or simply evaluating divergence from population marginals. By default, partial convergence messages appear if the sum of absolute differences in unit weights are changing by more than 1e-3 when a breaking rule is triggered. The degree to which this constitutes a meaningful divergence from target proportions is case dependent.

The parameter convergence is a named vector containing three values,

"pct"

A threshold governing convergence of the iterative raking process. Each iteration will typically adjust weights by a smaller amount than the previous iteration. If the adjustment is greater than (1 - pct) of the previous adjustment, convergence is achieved. In other words, as the magnitude of weight updating becomes stable across iterations, convergence is achieved. Default 0.01. anesrake uses 0.01. Ipfp and rake do not support this parameter. Values less than 0 or greater than 1 are not permitted.

"absolute"

If the absolute overall weight adjustment between iterations is less than this parameter, convergence is achieved. Default 1e-6. anesrake does not support this parameter. Ipfp uses 1e-10. ipfp uses a machine-dependent parameter approximately equal to 1e-8 on most machines.

"time"

Optional. If provided, runs the iterative raking algorithm for at most convergence["time"] seconds. In general, exiting after a fixed time period will have a negative impact on convergence. If provided, must be at least 0 and can be fractional (e.g. 0.5 will run the algorithm for half a second.

"single_weight"

If the maximum single weight adjustment between iterations is less than this parameter, convergence is achieved. Must be non-negative.

Users interested in achieving better convergence to target proportions at the cost of time should set the convergence["pct"] and convergence["absolute"] to low values (say, 0.0001 and 1e-8), and raise max_iterations as high as possible (say, 10,000). Users interested in quick but imperfect results should use convergence["time"] to cap runtime.

Variable Selection Functions and Parameters

Built-in selection functions include:

"pct"

Rake on variables whose initial error is greater than select_params[["pct"]]. This is the default variable selection function, and the default select_params[["pct"]] is 0.05. The units of select_params[["pct"]] depend on the error function selected, but for the default "linear" error function indicate "no more than 5 collectively").

"all"

Rake on all variables

"number"

Rake on exactly select_params[["count"]] variables. The default is 5. Variables are selected in descending order of error.

"lesser"

Rake on the smallest set of variables supplied by the select_params[["pct"]] and select_params[["count"]] arguments.

"greater"

Rake on the greater set of variables supplied by the select_params[["pct"]] and select_params[["count"]] arguments.

"pctlim"

Same as "pct", backwards compatibility for anesrake

"nlim"

Same as "number", backwards compatibility for anesrake

"nolim"

Same as "all", backwards compatibility for anesrake

"nmin"

Same as "greater", backwards compatibility for anesrake. Please note that "nmin" is equivalent to "greater", not "lesser".

"nmax"

Same as "lesser", backwards compatibility for anesrake. Please note that "nmax" is equivalent to "lesser", not "greater".

Built in select_params parameters are:

"pct"

Percentage threshold used if the selection function is "threshold", "lesser", or "greater". The scale of this threshold is total absolute percentage deviation if the linear error function is used, so pct = 0.05 implies a total deviation from target proportions of no less than 5 functions, unit scales may differ.

"count"

Number of variables to select if the selection function is "number", "lesser", or "greater". Variables are selected in descending order of error.

Error Functions

Custom selection functions should take two arguments: a named numeric vector which supplies the available variables and their calculated errors, and a named vector of parameters. Custom selection functions should return a non-empty subset of the named numeric vector.

Built-in error functions include:

"linear"

Sum of absolute differences

"squared"

Sum of squared differences

"max"

Maximum absolute difference

"mean"

Mean absolute difference

"maxsquared"

Maximum squared difference

"meansquared"

Mean squared difference

"total"

Same as "linear", backwards compatibility for anesrake

"average"

Same as "mean", backwards compatibility for anesrake

"totalsquared"

Same as "squared", backwards compatibility for anesrake

"averagesquared"

Same as "meansquared", backwards compatibility for anesrake

Custom error functions should take two arguments: a numeric vector containing the target proportions, and a numeric vector containing the current weighted performance. They should return a single numeric summary of the data.

Interpreting NA Values in Data

If data contains an NA in raking variables, harvest() will ignore those observations when raking on the variables where they are NA. This effectively means that when raking an age variable, respondents with missing age are assumed to be correctly proportioned by age. In addition, calculates of weighted marginals (for instance, for error), ignore NA respondents.

An alternative strategy supported by harvest() is for the user to specify add_na_proportion, an argument which will interpret missing data as a "decline to state" response category, and also alter target proportions to add such a category. The proportion missing in the data is assumed to be the population "decline to state" proportion. Other population proportions are adjusted accordingly, as if decline to state is distributed randomly with respect to other such values. Documentation for this argument is included above.

In cases where systematic nonresponse to a question is a problem, users might try external packages capable of imputing missing data, or else alter target proportions and data to remove missingness.

Specifying target

Target proportions contain three pieces of information: a variable-level pair and an associated proportion. target can be specified one of two ways: as a list of named vectors, or as a data frame.

If a list of named vectors, the list names are variable names, the vector names are variable levels, and the vector values are proportions. This is the specification used in anesrake's anesrake function, and also the manner in which the built-in ns_target dataset is specified.

If a data frame, harvest() attempts to match variables in the data frame to the three piece of information above. Matching occurs in the following order:

  1. If the user provides a target_map argument, then target_map should be a named vector whose names are "variable", "level", and "proportion" and whose values are the numeric indices or column names of the respective data in the target argument.

  2. If columns named "variable", "level", and "proportion" exist in the target data frame, then these will be used.

  3. If the target data frame is exactly three columns, then the first column is assumed to contain "variable", the second to contain "level", and the third to contain "proportion". A warning will be generated if the user provided some but not all of target_map

  4. If none of these conditions hold, an error will be produced.

Naming Weight Columns

If weights are attached to a data frame, the weights will be called "weights" by default. If such a column already exists, the column will be called ".weights.autumn". If this column already exists, ".weights_autumn1" through ".weights_autumn10" will be used. If all of these columns exist, harvest() will return an error. To customize the column name, use the argument weight_column described above.

Examples

## Not run: 
# Simple call
harvest(respondent_data, ns_target)

# Pipe workflow
respondent_data %>% harvest(ns_target)

# Return weights as vector instead of attaching to data frame
harvest(respondent_data, ns_target, attach_weights = FALSE)

# Modified convergence criteria to be more permissive
harvest(respondent_data, ns_target,
        convergence = c(
          pct = 0.05, absolute = 1e-4,
        ))

# Limit runtime to 3 seconds:
harvest(respondent_data, ns_target,
        convergence = c(
          pct = 0.01, absolute = 1e-6,
          time = 3
        ))

# Alternate error function or variable selection function:
harvest(respondent_data, ns_target,
        error_function = "meansquared",
        select_function = "number")

# Generate an annoying amount of diagnostic information
harvest(respondent_data, ns_target, verbose = 2)

## End(Not run)

aaronrudkin/autumn documentation built on Feb. 5, 2024, 6:08 p.m.