chaid_raking: CHAID Rake sample to match population (Population input is a...
In neale-eldash/pd: Day-2-day functions in R

Description Usage Arguments Value Examples

This function CHAID rakes the sample to match the population counts. This raking strategy is based on CHAID trees. The main idea is to run a CHAID tree in the survey data, using a pre-defined dependent variable (such as voting intention), then using the resulting leafs of the tree as the cells for raking. This algorithm has 5 basic steps:

Check variables: checks that same variables with same labels are in both dataframes.
CHAID Tree: Run the tree on the survey data, and create the resulting cells in both the survey and population dataframes.
Population targets: Calculates the population targets from the population dataframe.
Rake sample: Uses a adjusted raking algorithm adapted from in rake.
Check weights: Compares the weights to the population targets to make sure the raking worked.

chaid_raking(
  df.pop,
  df.svy,
  strata = NULL,
  id.var = NULL,
  dep = NULL,
  wgt.pop = NULL,
  minbucket = 30,
  cp = 0.001
)

`df.pop`	The population dataframe, containing the variables to be used in the analysis (weights, raking variable targets and strata variable). Both raking and strata variables have to exist in both survey and population dataframe. The algorithm checks the existence of these variables, but does not check that they are coded correctly in both datasets.
`df.svy`	The sample dataframe, containing the variables to be used in the analysis (unique id, raking variable targets, strata variable and dependent variable to build the tree). Both raking and strata variables have to exist in both survey and population dataframe. The algorithm checks the existence of these variables, but does not check that they are coded correctly in both datasets.
`strata`	A string with the name of the stratifying variable. If this variable is defined, raking will be performed within each stratum. This variable should exist in both the sample and population dataframes.
`id.var`	A string with the name of the unique id variable. This variable needs to exist only in the survey dataframe.
`dep`	A string with the name of the dependent variable to be used in the CHAID analysis. This variable needs to exist only in the survey dataframe.
`wgt.pop`	A string with the name of the weight variable. THis variable will be used to calculate the population targets. If there is no weight variable in the population dataframe, create a constant variable. This variable needs to exist only in the population dataframe.
`minbucket[Optional]`	A integer number representing the minimum number of sample units in each leaf of the CHAID Tree. Default value is 30.
`cp[Optional]`	A real number representing the complexity of the CHAID Tree. Default value is 0.001.

A list with three components:

df.svy(dataframe): the original sample dataframe with the weights.
cells.pop(dataframe): the population target cells created by the CHAID Tree.
check(dataframe): comparison of all weights and population totals.
trees(dataframe): The output from all trees (per stratum).
grps(dataframe): Description of all cells created by the algorithm.

##load data
# Survey data
data(svy.vote)
# Population data
data(cps)

## Raking WITHOUT strata variable:
rake.chaid <- chaid_raking(cps,svy.vote,id.var='RESPID',wgt.pop='PWSSWGT',dep='lead',minbucket = 40,cp = 0.000001)

## Raking WITH strata variable:
rake.chaid.strata <- chaid_raking(cps,svy.vote,strata='STATE',id.var='RESPID',wgt.pop='PWSSWGT',dep='lead',minbucket = 40,cp = 0.000001)

### save all trees - chaid raking with strata
file <- "C://tree_raking.pdf"
pdf(file,paper = 'a4r', width = 12)
purrr::walk(rake.chaid.strata$trees$fit,~prp(.$tree, faclen = 0, cex = 0.8, extra = 1, main=.$cells.svy$strata[[1]]))
dev.off()