harvest | R Documentation |
This function implements a variation of iterative raking, as described in
DeBell and Krosnick (2009). It replaces the anesrake
function in the anesrake package and adds support for modern data types, a
tidy workflow, additional user control, and faster estimation.
harvest()
is designed so that for most users, the default two-argument
function call harvest(data, targets)
will behave well, but almost
every element of the process can be customized by users who want additional
control.
harvest(data, target, start_weights = 1, max_weight = 5,
max_iterations = 1000, select_params = c(pct = 0.05, count = 5),
convergence = c(pct = 0.01, absolute = 1e-06, time = NULL, single_weight =
NULL), select_function = "pct", error_function = "linear",
verbose = FALSE, attach_weights = TRUE, weight_column = NULL,
add_na_proportion = FALSE, target_map = NULL, enforce_mean = TRUE, ...)
data |
A data frame (tibble) or matrix containing data to be raked. The data can contain columns not used in the raking, but must contain all the columns used in the raking |
target |
A list of target proportions in the population of interest.
This argument can be one of two formats: a list of named numeric vectors,
or a data frame (tibble). If a data frame, see the |
start_weights |
Starting weights. This may either be a single positive number (which will be implicitly renormed to 1), or a vector of length n, where n is the number of rows in the data. No values in this vector may be NA, but some can be 0. Lovelace et al. (2015) found that initial weights generally have very little impact on final weight estimations. Selecting better initial weights may speed up convergence. |
max_weight |
A maximum value to clamp weights to. By default, as per
DeBell and Krosnick (2009) and |
max_iterations |
A maximum number of iterations per raking attempt.
The default is 1,000. Note that the total number of iterations may exceed
this number if after raking, additional variables display imbalance.
Defaults in |
select_params |
A named vector of variable selection parameters. Which names to supply depends on the variable selection function. Parameters for built-in variable selection functions are described below. |
convergence |
A named vector of convergence parameters. These are described below but the defaults are well-tuned for both speed and convergence to population marginals. |
select_function |
Specification of error function (how we measure which variables to rake on). This can either be a character vector specifying a built-in function, or a function closure (unquoted name of a function) which calculates a custom selection method. The built-in options are "threshold" (default), "all", "number", "lesser", or "greater". You can read more about these options below. Discussion of custom selection functions is also found below. |
error_function |
Specification of error function (how we measure how
far off a variable is from its intended result). This can be either a
character vector specifying a built-in function, or a function closure (
unquoted name of a function) which calculates a custom error rate. The
built-in options are "linear" (default), "max", "squared", "mean",
"maxsquared", and "meansquared". You can read more about these options
below. "total", "average", "totalsquared", and "averagesquared" are also
accepted for backwards compatibility with
|
verbose |
Level of verbosity, defaults to FALSE. At TRUE or 1, the function begins emitting progress information. At 2, each iteration provides a significant amount of progress information. |
attach_weights |
A binary value, default TRUE. If FALSE, this function will return weights as a vector. If TRUE, this function will attach weights to the data frame provided and return the new data frame. The weights will be added in a column named "weights". If a column named "weights" is present, backup options will be used and the user will receive feedback. |
weight_column |
A quoted character vector specifying a name for the
column attached if |
add_na_proportion |
If TRUE, |
target_map |
Used only if |
enforce_mean |
Default TRUE. By default, weights minimize divergence
from target proportions subject to two conditions: that the mean weight
be 1, and that the maximum weight be capped at |
... |
Additional arguments to this function are ignored |
The default parameterization of the function works very well. There should be little need to select the alternate calculation methods or tweak any of the parameters. To the extent that weights do not converge, this is likely to be a pathology of the data or the target proportions rather than the parameterization of iterative raking. Documentation below is primarily intended for advanced users who want to customize parameters in detail.
The original data frame data
augmentd with a new column
containing the calculated weights if attach_weights
is TRUE
(default). If attach_weights
is FALSE, a vector of numeric
weights in the order of the supplied cases.
By default, harvest()
will return a warning if results do not
convergence or only partial convergence is achieved. This normally occurs
if rate of convergence slows before weights are stabilized. If this occurs,
users can choose between altering parameters to force better convegence,
or simply evaluating divergence from population marginals. By default,
partial convergence messages appear if the sum of absolute differences in
unit weights are changing by more than 1e-3
when a breaking rule is
triggered. The degree to which this constitutes a meaningful divergence
from target proportions is case dependent.
The parameter convergence
is a named vector containing three values,
A threshold governing convergence of the iterative raking
process. Each iteration will typically adjust weights by a smaller
amount than the previous iteration. If the adjustment is greater than
(1 - pct
) of the previous adjustment, convergence is achieved.
In other words, as the magnitude of weight updating becomes stable
across iterations, convergence is achieved. Default 0.01.
anesrake
uses 0.01. Ipfp
and rake
do not support this parameter. Values
less than 0 or greater than 1 are not permitted.
If the absolute overall weight adjustment between
iterations is less than this parameter, convergence is achieved. Default
1e-6. anesrake
does not support this parameter.
Ipfp
uses 1e-10. ipfp
uses a
machine-dependent parameter approximately equal to 1e-8 on most machines.
Optional. If provided, runs the iterative raking algorithm
for at most convergence["time"]
seconds. In general, exiting
after a fixed time period will have a negative impact on convergence. If
provided, must be at least 0 and can be fractional (e.g. 0.5 will run
the algorithm for half a second.
If the maximum single weight adjustment between iterations is less than this parameter, convergence is achieved. Must be non-negative.
Users interested in achieving better convergence to target proportions at the
cost of time should set the convergence["pct"]
and
convergence["absolute"]
to low values (say, 0.0001
and
1e-8
), and raise max_iterations
as high as possible
(say, 10,000
). Users interested in quick but imperfect results should
use convergence["time"]
to cap runtime.
Built-in selection functions include:
Rake on variables whose initial error is greater than
select_params[["pct"]]
. This is the default variable selection
function, and the default select_params[["pct"]]
is 0.05. The
units of select_params[["pct"]]
depend on the error function
selected, but for the default "linear" error function indicate
"no more than 5
collectively").
Rake on all variables
Rake on exactly select_params[["count"]]
variables.
The default is 5. Variables are selected in descending order of error.
Rake on the smallest set of variables supplied by
the select_params[["pct"]]
and select_params[["count"]]
arguments.
Rake on the greater set of variables supplied by
the select_params[["pct"]]
and select_params[["count"]]
arguments.
Same as "pct", backwards compatibility for anesrake
Same as "number", backwards compatibility for anesrake
Same as "all", backwards compatibility for anesrake
Same as "greater", backwards compatibility for anesrake. Please note that "nmin" is equivalent to "greater", not "lesser".
Same as "lesser", backwards compatibility for anesrake. Please note that "nmax" is equivalent to "lesser", not "greater".
Built in select_params
parameters are:
Percentage threshold used if the selection function is
"threshold", "lesser", or "greater". The scale of this threshold is
total absolute percentage deviation if the linear error function is
used, so pct = 0.05
implies a total deviation from target
proportions of no less than 5
functions, unit scales may differ.
Number of variables to select if the selection function is "number", "lesser", or "greater". Variables are selected in descending order of error.
Custom selection functions should take two arguments: a named numeric vector which supplies the available variables and their calculated errors, and a named vector of parameters. Custom selection functions should return a non-empty subset of the named numeric vector.
Built-in error functions include:
Sum of absolute differences
Sum of squared differences
Maximum absolute difference
Mean absolute difference
Maximum squared difference
Mean squared difference
Same as "linear", backwards compatibility for anesrake
Same as "mean", backwards compatibility for anesrake
Same as "squared", backwards compatibility for anesrake
Same as "meansquared", backwards compatibility for anesrake
Custom error functions should take two arguments: a numeric vector containing the target proportions, and a numeric vector containing the current weighted performance. They should return a single numeric summary of the data.
NA
Values in DataIf data contains an NA
in raking variables, harvest()
will
ignore those observations when raking on the variables where they are NA.
This effectively means that when raking an age
variable, respondents
with missing age are assumed to be correctly proportioned by age. In
addition, calculates of weighted marginals (for instance, for error), ignore
NA respondents.
An alternative strategy supported by harvest()
is for the user to
specify add_na_proportion
, an argument which will interpret missing
data as a "decline to state" response category, and also alter target
proportions to add such a category. The proportion missing in the
data is assumed to be the population "decline to state" proportion. Other
population proportions are adjusted accordingly, as if decline to state is
distributed randomly with respect to other such values. Documentation for
this argument is included above.
In cases where systematic nonresponse to a question is a problem, users might try external packages capable of imputing missing data, or else alter target proportions and data to remove missingness.
target
Target proportions contain three pieces of information: a variable-level pair
and an associated proportion. target
can be specified one of two
ways: as a list of named vectors, or as a data frame.
If a list of named vectors, the list names are variable names, the vector
names are variable levels, and the vector values are proportions. This is
the specification used in anesrake
's anesrake
function, and
also the manner in which the built-in ns_target
dataset is
specified.
If a data frame, harvest()
attempts to match variables in the data
frame to the three piece of information above. Matching occurs in the
following order:
If the user provides a target_map
argument, then
target_map
should be a named vector whose names are "variable",
"level", and "proportion" and whose values are the numeric indices or
column names of the respective data in the target
argument.
If columns named "variable", "level", and "proportion" exist in the
target
data frame, then these will be used.
If the target
data frame is exactly three columns, then the
first column is assumed to contain "variable", the second to contain
"level", and the third to contain "proportion". A warning will be
generated if the user provided some but not all of target_map
If none of these conditions hold, an error will be produced.
If weights are attached to a data frame, the weights will be called
"weights" by default. If such a column already exists, the column will be
called ".weights.autumn". If this column already exists, ".weights_autumn1"
through ".weights_autumn10" will be used. If all of these columns exist,
harvest()
will return an error. To customize the column name, use
the argument weight_column
described above.
## Not run:
# Simple call
harvest(respondent_data, ns_target)
# Pipe workflow
respondent_data %>% harvest(ns_target)
# Return weights as vector instead of attaching to data frame
harvest(respondent_data, ns_target, attach_weights = FALSE)
# Modified convergence criteria to be more permissive
harvest(respondent_data, ns_target,
convergence = c(
pct = 0.05, absolute = 1e-4,
))
# Limit runtime to 3 seconds:
harvest(respondent_data, ns_target,
convergence = c(
pct = 0.01, absolute = 1e-6,
time = 3
))
# Alternate error function or variable selection function:
harvest(respondent_data, ns_target,
error_function = "meansquared",
select_function = "number")
# Generate an annoying amount of diagnostic information
harvest(respondent_data, ns_target, verbose = 2)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.