anesrake: Function to perform full ANES variable selection and...

Description Usage Arguments Value Author(s) References Examples

Description

anesrake takes a list of variables and target values and determines how they should be weighted to match the procedures outlined in DeBell and Krosnick, 2009. It then performs raking to develop weights for the variables selected such that they match the targets provided.

Usage

1
2
3
4
anesrake(inputter, dataframe, caseid, weightvec = NULL,
cap = 5, verbose = FALSE, maxit = 1000, type = "pctlim",
pctlim = 5, nlim = 5, filter = 1, choosemethod = "total",
iterate = TRUE, convcrit = 0.01, force1=TRUE, center.baseweights=TRUE)

Arguments

inputter

The inputter object should contain a list of all target values for the raking procedure. Each list element in inputter should be a vector corresponding to the weighting targets for a single variable. Hence, the vector enumerating the weighting targets for a variable with 2 levels should be of length 2, while a vector enumerating the weighting targets for a variable with 5 levels should be of length 5. List elements in inputter should be named according to the variable that they will match in the corresponding dataset. Hence, a list element enumerating the proportion of the sample that should be of each gender should be labeled "female" if the variable in dataframe is also titled "female."

inputter elements must be vectors and can be of class numeric, or factor and must match the class of the corresponding variable in dataframe. Logical variables in dataframe can be matched to a numeric vector of length 2 and ordered with the TRUE target as the first element and the FALSE target as the second element. Targets for factors must be labeled to match every level present in the dataframe (e.g. a variable with 2 age groups "under40" and "over40" should have elements named "under40" and "over40" respectively). anesrake attempts to conform any unrecognized types of vectors to class(numeric). Weighting targets can be entered either as an N to be reached or as a percent for any given variable. Targets can be either proportions (ideal) or the number of individuals in the population in each target category (N). Totals of greater than 1.5 for any given list element are treated as Ns, while values of less than 1.5 are treated as percentages.

dataframe

The dataframe command identifies a data.frame object of the data to be weighted. The data.frame must contain all of the variables that will be used in the weighting process and those variables must have the same names as are present in the inputter list element.

caseid

The caseid command identifies a unique case identifier for each individual in the dataset. If filters are to be used, the resulting list of weights will be a different length from the overall dataframe. caseid is included in the output so that weights can be matched to the dataset of relevance. caseid must be of a length matching the number of cases in dataframe.

weightvec

weightvec is an optional input if some kind of base weights, stratification correction, or other sampling probability of note that should be accounted for before weighting is conducted. If defined, weightvec must be of a length equivalent to the number of cases in the dataframe. If undefined, weightvec will be automatically seeded with a vector of 1s.

cap

cap defines the maximum weight to be used. cap can be defined by the user with the command cap=x, where x is any value above 1 at which the algorithm will cap weights. If cap is set below 1, the function will return an error. If cap is set between 1 and 1.5, the function will return a warning that the low cap may substantially increase the amount of time required for weighting. In the absence of a user-defined cap, the algorithm defaults to a starting value of 5 in line with DeBell and Krosnick, 2009. For no cap, cap simply needs to be set to an arbitrarily high number. (Note: Capping using the cap command caps at each iteration.)

verbose

Users interested in seeing the progress of the algorithm can set verbose to equal TRUE. The algorithm will then inform the user of the progress of each raking and capping iteration.

maxit

Users can set a maximum number of iterations for the function should it fail to converge using maxit=X, where X is the maximum number of iterations. The default is set to 1000.

type

type identifies which manner of variable identification should be used to select weighting variables. Five options are available: type=c("nolim", "pctlim", "nlim", "nmin", "nmax"). If type="nolim", all variables specified in inputter will be included in the weighting procedure. If type="pctlim" (DEFAULT), the variable selection algorithm will assess which variables have distributions that deviate from their targets by more than the amount specified by the pctlim command using the method choosemethod. If type="nlim", the variable selection algorithm will use the number of varibles specified by nlim, choosing the most discrepant variables as identified by the choosemethod command. If type="nmin", the variable selection algorithm will use at least nlim variables, but will include more if additional variables are off by more than pctmin (all identified using choosemethod). If type="nmax", the variable selection algorithm will use no more than nlim variables, but will only use that many variables if at least that many are off by more than pctlim (all identified using choosemethod).

pctlim

pctlim is the discrepancy limit for selection. Variable selection will only select variables that are discrepant by more than the amount specified. pctlim can be specified either in percentage points (5 is 5 percent) or as a decimal (.05 is 5 percent). The algorithm assumes that a decimal is being used if pctlim<1. Hence researchers interested in a discrepancy limit of half a percent would need to use pctlim=.005.

nlim

nlim is the number of variables to be chosen via the variable selection method chosen in choosemethod.

filter

filter is a vector of 1 for cases to be included in weighting and 0 for cases that should not be included. The filter vector must have the same number of cases as the dataframe. In the absence of a user-defined filter, the algorithm defaults to a starting value of 1 (inclusion) for all individuals.

choosemethod

choosemethod is the method for choosing most discrepant variables. Six options are available: choosemethod=c("total", "max", "average", "totalsquared", "maxsquared", "averagesquared"). If choosemethod="total", variable choice is determined by the sum of the differences between actual and target values for each prospective weighting variable. If choosemethod="max", variable choice is determined by the largest individual difference between actual and target values for each prospective weighting variable. If choosemethod="average", variable choice is determined by the mean of the differences between actual and target values for each prospective weighting variable. If choosemethod="totalsquared", variable choice is determined by the sum of the squared differences between actual and target values for each prospective weighting variable. If choosemethod="maxsquared", variable choice is determined by the largest squared difference between actual and target values for each prospective weighting variable (note that this is identical to choosemethod="max" if the selection type is nlim). If choosemethod="averagesquared", variable choice is determined by the mean of the squared differences between actual and target values for each prospective weighting variable.

iterate

iterate is a logical variable for how raking should proceed if type=c("pctlim", "nmin", "nmax") conditions. If iterate=TRUE, anesrake will check whether any variables that were not used in raking deviate from their targets by more than pctlim percent. When this is the case, raking will be rerun using the raked weights as seeds (weightvec) with additional varibles that meet this qualification after raking included as well. For the type="nmax" condition, this will only occur if nlim has not been met.

convcrit

convcrit is the criterion for convergence. The raking algorithm is determined to have converged when the most recent iteration represents less than a convcrit percentage improvement over the prior iteration.

force1

force1 ensures that the categories of each raking variable sum to 1. To do so, the target in inputter for each variable is divided by the sum of the targets for that category.

center.baseweights

center.baseweights forces the initial baseweight to mean to 1 if true (the default setting).

Value

A list object of anesrake has the following elements:

weightvec

Vector of weights From raking algorithm

type

Type of variable selection used (identical to specified type)

caseid

Case IDs for final weights – helpful for matching weightvec to cases if a filter is used

varsused

List of variables selected for weighting

choosemethod

Method for choosing variables for weighting (identical to specified choosemethod)

converge

Notes whether full convergence was achieved, algorithm failed to converge because convergence was not possible, or maximum iterations were reached

nonconvergence

Measure of remaining discrepancy from benchmarks if convergence was not achieved

targets

inputter from above, a list of the targets used for weighting

dataframe

Copy of the original dataframe used for weighting (filter variable applied if specified)

iterations

Number of iterations required for convergence (or non-convergence) of final model

iterate

Copy of iterate from above

Author(s)

Josh Pasek, Assistant Professor of Communication Studies at the University of Michigan (www.joshpasek.com).

References

DeBell, M. and J.A. Krosnick. (2009). Computing Weights for American National Election Study Survey Data, ANES Technical Report Series, No. nes012427. Available from: ftp://ftp.electionstudies.org/ftp/nes/bibliography/documents/nes012427.pdf

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
data("anes04")

anes04$caseid <- 1:length(anes04$age)

anes04$agecats <- cut(anes04$age, c(0, 25,35,45,55,65,99))
levels(anes04$agecats) <- c("age1824", "age2534", "age3544",
          "age4554", "age5564", "age6599")

marriedtarget <- c(.4, .6)

agetarg <- c(.10, .15, .17, .23, .22, .13)
names(agetarg) <- c("age1824", "age2534", "age3544",
          "age4554", "age5564", "age6599")

targets <- list(marriedtarget, agetarg)

names(targets) <- c("married", "agecats")

outsave <- anesrake(targets, anes04, caseid=anes04$caseid,
          verbose=TRUE)

caseweights <- data.frame(cases=outsave$caseid, weights=outsave$weightvec)

summary(caseweights)

summary(outsave)

Example output

Loading required package: Hmisc
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2

Attaching package: 'Hmisc'

The following objects are masked from 'package:base':

    format.pval, round.POSIXt, trunc.POSIXt, units

Loading required package: weights
Loading required package: gdata
sh: 1: cannot create /dev/null: Permission denied
gdata: Unable to locate valid perl interpreter
gdata: 
gdata: read.xls() will be unable to read Excel XLS and XLSX files
gdata: unless the 'perl=' argument is used to specify the location of a
gdata: valid perl intrpreter.
gdata: 
gdata: (To avoid display of this message in the future, please ensure
gdata: perl is installed and available on the executable search path.)
sh: 1: cannot create /dev/null: Permission denied
gdata: Unable to load perl libaries needed by read.xls()
gdata: to support 'XLX' (Excel 97-2004) files.

gdata: Unable to load perl libaries needed by read.xls()
gdata: to support 'XLSX' (Excel 2007+) files.

gdata: Run the function 'installXLSXsupport()'
gdata: to automatically download and install the perl
gdata: libaries needed to support Excel XLS and XLSX formats.

Attaching package: 'gdata'

The following object is masked from 'package:Hmisc':

    combine

The following object is masked from 'package:stats':

    nobs

The following object is masked from 'package:utils':

    object.size

The following object is masked from 'package:base':

    startsWith

Loading required package: mice
[1] "Raking...Iteration 1"
[1] "Current iteration changed total weights by 303.931933308375"
[1] "Raking...Iteration 2"
[1] "Current iteration changed total weights by 50.8760710934304"
[1] "Raking...Iteration 3"
[1] "Current iteration changed total weights by 2.45181318049929"
[1] "Raking...Iteration 4"
[1] "Current iteration changed total weights by 0.115529921587497"
[1] "Raking...Iteration 5"
[1] "Current iteration changed total weights by 0.00543798477855262"
[1] "Raking...Iteration 6"
[1] "Current iteration changed total weights by 0.000255952672060356"
[1] "Raking...Iteration 7"
[1] "Current iteration changed total weights by 1.20470391882233e-05"
[1] "Raking...Iteration 8"
[1] "Current iteration changed total weights by 5.67023318742699e-07"
[1] "Raking...Iteration 9"
[1] "Current iteration changed total weights by 2.66883061206258e-08"
[1] "Raking...Iteration 10"
[1] "Current iteration changed total weights by 1.25607590995003e-09"
[1] "Raking...Iteration 11"
[1] "Current iteration changed total weights by 5.92315085867767e-11"
[1] "Raking...Iteration 12"
[1] "Current iteration changed total weights by 2.76700884427328e-12"
[1] "Raking...Iteration 13"
[1] "Current iteration changed total weights by 2.07389660999979e-13"
[1] "Raking...Iteration 14"
[1] "Current iteration changed total weights by 1.23678844943242e-13"
[1] "Raking...Iteration 15"
[1] "Current iteration changed total weights by 1.2356782264078e-13"
[1] "Raking converged in 15 iterations"
     cases           weights      
 Min.   :   1.0   Min.   :0.4665  
 1st Qu.: 303.8   1st Qu.:0.6956  
 Median : 606.5   Median :0.8864  
 Mean   : 606.5   Mean   :1.0000  
 3rd Qu.: 909.2   3rd Qu.:1.1874  
 Max.   :1212.0   Max.   :1.7870  
$convergence
[1] "Complete convergence was achieved after 15 iterations"

$base.weights
[1] "No Base Weights Were Used"

$raking.variables
[1] "married" "agecats"

$weight.summary
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4665  0.6956  0.8864  1.0000  1.1874  1.7870 

$selection.method
[1] "variable selection conducted using _pctlim_ - discrepancies selected using _total_."

$general.design.effect
[1] 1.128633

$married
      Target Unweighted N Unweighted %     Wtd N Wtd % Change in % Resid. Disc.
FALSE    0.6          563     0.464905  726.6693   0.6   0.1350950 0.000000e+00
TRUE     0.4          648     0.535095  484.4462   0.4  -0.1350950 5.551115e-17
Total    1.0         1211     1.000000 1211.1155   1.0   0.2701899 5.551115e-17
      Orig. Disc.
FALSE   0.1350950
TRUE   -0.1350950
Total   0.2701899

$agecats
        Target Unweighted N Unweighted %   Wtd N Wtd %  Change in %
age1824   0.10          150    0.1237624  121.20  0.10 -0.023762376
age2534   0.15          205    0.1691419  181.80  0.15 -0.019141914
age3544   0.17          217    0.1790429  206.04  0.17 -0.009042904
age4554   0.23          237    0.1955446  278.76  0.23  0.034455446
age5564   0.22          216    0.1782178  266.64  0.22  0.041782178
age6599   0.13          187    0.1542904  157.56  0.13 -0.024290429
Total     1.00         1212    1.0000000 1212.00  1.00  0.152475248
        Resid. Disc.  Orig. Disc.
age1824 0.000000e+00 -0.023762376
age2534 0.000000e+00 -0.019141914
age3544 0.000000e+00 -0.009042904
age4554 2.775558e-17  0.034455446
age5564 5.551115e-17  0.041782178
age6599 0.000000e+00 -0.024290429
Total   8.326673e-17  0.152475248

anesrake documentation built on May 2, 2019, 1:42 p.m.

Related to anesrake in anesrake...