etc/GM/Weighting_with_Regions-old.R

#' ---
#' title: "Poststratification Weighting using Gender and Regions"
#' author: "Georges Monette"
#' date: "August 1, 2016"
#' output: 
#'      html_document:
#'              toc: true
#'              toc_depth: 4
#' ---
#' 
#' _Note: I tried to use 'ave' in the base package instead of 'capply' to avoid using the spida2
#' package. I found bugs with 'ave' that don't occur with 'capply', so I am using 'spida2'._
#' 
#' This example illustrates how weights for strata formed by regions (e.g. States) 
#' can be combined with with weights created for 'within-region' strata (e.g. Gender)
#' to form 'overall' weights that can be used both to estimate the overall mean of a
#' response as well as state-by-state means and means of groups of states or, in general,
#' any group of strata.
#' 
#' Thus it is not necessary to work with different set of weights to estimate
#' state-by-state means and to estimate the overall mean.
#' 
#' From the start, we need to recognize a distinction between a 'weighted mean'
#' and a 'linear combination'. For a weighted mean, it doesn't matter how the weights
#' are 'normed', i.e. whether they sum to 1 or are all multiplied by the same constant
#' to make them sum to some other value such as the sample size, $n$, or the population
#' size $N$. The last choice produces 'Horvitz-Thompson' weights.
#' 
#' The formula to estimate a population weighted mean, is 
#' $$\frac{\sum_{k,i} w_{k,i} y_{k,i}}{\sum_{k,i} w_{k,i}}$$
#' where $w_{k,i}$ and $y_{k,i}$ are the weights and responses respectively 
#' for the observation $i$ in the stratum $k$. We assume that $w_{k,i}$ is constant 
#' within each
#' stratum and we can therefore simply write $w_k$.
#' 
#' The formula above yields the same value if all the weights are multiplied by the
#' same constant. 
#' 
#' To estimate the weighted population mean of a subgroup of strata, one merely uses the
#' same weights in the corresponding subgroup of the sample. One set of weights
#' serves for all weighted means. 
#' 
#' Later, we will also discuss the estimation of linear combinations, i.e. estimates of
#' the form
#' 
#' $$\sum_{k,i} c_{k,i} y_{k,i}$$
#' 
#' where $c_{k,i}$ combine post-stratification weights with coefficients intended to
#' estimate population parameters that cannot be estimated as weighted means, for example:
#' unweighted means of state means, gender gaps, and between state comparisons. 
#' 
#' ## Three common norms
#' 
#' As mentionned earlier, a weighted mean can use weights that are 
#' normed to sum to any positive value. Three following three choices are common.
#' 
#' ### Sum to $N$ (Horvitz-Thompson)
#' 
#' The weight in stratum $k$ is $w_k = N_k/n_k$. 
#' The linear combination $\sum\limits_{k,i} {{w_k}{y_{k,i}}}$ estimates 
#' the total of $y_{k,i}$ in the population.
#' 
#' ### Sum to $n$
#' 
#' These are the weights used in the Veracio example.  Since they sum to $n$, the average weight per sampled unit is 1. This facilitates identifying unusually
#' large weights that can be truncated to a maximum.  Note that these weights can be obtained by taking the ratio of the proportion that a sample is of the population 
#' divided by its proportion in the sample, i.e.
#' $$ w_k = \frac{N_k/N}{n_k/n}$$
#' 
#' When used to form an estimate with a linear combination, these weights produce an estimate
#' of the sum of $y_{k,i}$ in a sample of size $n$ whose strata are proportional in size to those 
#' of the population. This does not seem to be a quantity of special interest.
#' 
#' ### Sum to 1
#' 
#' When weights are normed to sum to 1, they can be used as coefficients 
#' of the linear combination, $\sum_{k,i} {w_k} y_{k,i}$ to directly yield an estimate
#' of the population mean since the denominator 
#' in the expression for the weighted sum, $\sum_{k,i} {w_k}$, is equal to one.  
#' 
#' ## Overall weights
#' 
#' Weights can be generated for the 'crossed' strata generated by State and Gender, e.g.
#' $w_k = N_k/n_k$ for each State by Gender combination.  
#' These weights can be used to estimate
#' the overall weighted average. 
#' 
#' They can also be used to estimate weighted averages within
#' each stratum and within any combination of strata simply by setting weights for strata
#' that are not included to 0.
#' 
#' ## Overall weights from within-region weights
#' 
#' The method used to transform within-region weights to overall weights depends on
#' the way the within-regions weights are normed.  
#' 
#' Let $N_{sg}$ and $n_{sg}$ be the size of the population and sample, respectively,
#' in 'State' $s$ and 'Gender' $g$. Let $N_g$ and $n_g$ be the corresponding sizes for
#' 'State' $s$. Note that $N_s = \sum_g N_{sg}$ and $n_s = \sum_g n_{sg}$.
#' 
#' ### Sum to $N_s$
#' 
#' The Horvitz-Thompson weights are already overall weights with no further transformation.
#' 
#' ### Sum to $n_s$
#' 
#' Within-state 'sum to $n_s$' weights have the form $\frac{N_{sg}/N_s}{n_{sg}/n_s}$, so multiplying
#' each by $\frac{N_s/N}{n_s/n}$ will yield overall 'sum to $n$' weights:
#' $$\frac{N_{sg}/N_s}{n_{sg}/n_s} \times \frac{N_s/N}{n_s/n} = \frac{N_{sg}/N}{n_{sg}/n}$$
#' 
#' Note that the factor $\frac{N_s/N}{n_s/n}$ is the vector of 'sum to $n$' weights for the 
#' between-state strata, i.e. the 'sum to $n$' weights if 'State' were the only stratification
#' variable.
#' 
#' ### Sum to 1
#' 
#' The within-state 'sum to 1' weights have the form $\frac{N_{sg}/N_s}{n_{sg}}$ so multiplying 
#' each by $N_s/N$ yields the overall 'sum to 1' weights:
#' $$\frac{N_{sg}/N_s}{n_{sg}} \times N_s/N = \frac{N_{sg}/N_s}{n_{sg}}$$
#' 
#' ## Functions for weighted means
#' 
#' The following functions provide weighted means and jacknife estimates of
#' standard errors of weighted means.  Functions for linear combinations are discussed
#' later. I think that a better implementation would treat both types of estimates in
#' a way that is more unified. 
#' 
#' For the weighted mean of x:
#' 
library(spida2)  # devtools::install_github('gmonette/spida2')
library(MASS)   # comes with default R installation
library(latticeExtra)  # from CRAN, also loads package:lattice

wtd_mean <- function(x, w) sum(x*w) / sum(w)    
#'
#' For jacknife estimates of the weighted mean, we use the following function. Note that
#' weights are adjusted to take the change in $n$ into account when obtaining the
#' jacknife estimate dropping an observation within a stratum.
#'
jk_wtd_means <- function(x, by, w, check = TRUE, FUN = wtd_mean){
        # Value: vector of 'drop-one' estimates for jacknife estimator of SE
        #
        # x: response variable, a numeric vector
        # by: vector or list of vectors defining strata
        # w: numeric vector of sampling weights constant within each stratum
        # Check that weights are constant within levels of 'by'
        library(spida2)  # devtools::install_github('gmonette/spida2')
        if(check) {
                constant <- function(z) length(unique(z)) <= 1 
                constant_by <- function(z, by) all(sapply(split(z, by), constant))
                if(!constant_by(w, by)) stop("w must constant within strata")
        }
        ns <- capply(x, by, length)
        force(w)
        by <- interaction(by)
        wtd_mean_drop_i <- function(i) {
                ns_drop <- capply(x[-i], by[-i], length)
                w_drop <- w[-i] * ns[-i] / ns_drop
                FUN(x[-i], w_drop)
        }
        sapply(seq_along(x), wtd_mean_drop_i)
}
#'
#' For the jacknife estimate of the standard error of the weighted mean:
#'  
jk_wtd_mean_se <- function(x, by = rep(1, length(x)), w = rep(1, length(x)), FUN = wtd_mean) {
        # Value: jacknife estimator of the standard error of a weighted mean
        # using post-stratification
        #
        # x: response variable, a numeric vector
        # by: vector or list of vectors defining strata
        # w: numeric vector of sampling weights constant within each stratum
        #
        library(spida2)
        theta_hat <- FUN(x, w)
        theta_hat_i <- jk_wtd_means(x, by, w, FUN = FUN)
        ns <- capply(x, by, length)
        sqrt(sum((theta_hat_i - theta_hat)^2 * (ns-1) / ns))
}
#' 
#' ## Example 
#'
#' This is a survey in 3 states along the eastern edge of the Rockies.
#' 
ds <- read.table(header = TRUE, text = "
Gender   State    Income
Male     Utah        10
Male     Utah        11
Male     Utah        12
Male     Utah        13
Male     Utah        14
Male     Utah        15
Female   Utah        16
Female   Utah        17
Female   Utah        18
Female   Utah        19
Male     Idaho       20
Male     Idaho       21
Male     Idaho       22
Female   Idaho       23
Female   Idaho       24
Female   Idaho       25
Female   Idaho       26
Female   Idaho       27
Female   Idaho       28
Female   Idaho       29
Male     Arizona     30
Male     Arizona     31
Male     Arizona     32
Male     Arizona     33
Male     Arizona     34
Male     Arizona     35
Male     Arizona     36
Female   Arizona     37
Female   Arizona     38
Female   Arizona     39
")
#'
#' and the population within each stratum (imaginary data):
#' 
dpop <- read.table(header = TRUE, text = "
Gender   State          N
Male     Utah     1500000
Female   Utah     1500000
Male     Idaho     700000
Female   Idaho     900000
Male     Arizona  4000000
Female   Arizona  3000000
")
#' 
#' First we create the 'overall' $n$ and $N$ variables.
#' 
ds$n <- with(ds, capply(State, list(State, Gender), length))  
#' 
#' Note that first argument of 'capply', 'State', could be any variable in the dataset
#' since it is used only for the length of the chunks within each 'State' by 'Gender' stratum.
#' 
ds <- merge(ds, dpop)
head(ds)
#'
#' ### Overall Horvitz-Thompson weights
#' 
ds$w_o_ht <- with(ds, N/n)
head(ds)
sum(ds$w_o_ht)

with(ds, wtd_mean(Income, w_o_ht))
with(ds, jk_wtd_mean_se(Income, list(Gender,State), w_o_ht))
with(ds, jk_wtd_mean_se(Income))
sd(ds$Income)/sqrt(length(ds$Income))
#'
#' ### Weighted mean within Utah
#'
with(subset(ds, State == "Utah"), wtd_mean(Income, w_o_ht))
with(subset(ds, State == "Utah"), jk_wtd_mean_se(Income, Gender, w_o_ht))
#' alternatively:
with(ds, wtd_mean(Income, w_o_ht * (State == "Utah")))
with(ds, jk_wtd_mean_se(Income, list(Gender, State), w_o_ht * (State == "Utah")))
#'
#' ### Weighted mean within Utah and Idaho
#'
with(subset(ds, State %in% c("Utah","Idaho")), wtd_mean(Income, w_o_ht))
with(subset(ds, State %in% c("Utah","Idaho")), jk_wtd_mean_se(Income, list(Gender, State), w_o_ht))
#' alternatively:
with(ds, wtd_mean(Income, w_o_ht * (State %in% c("Utah","Idaho"))))
with(ds, jk_wtd_mean_se(Income, list(Gender, State), w_o_ht * (State %in% c("Utah","Idaho"))))
#'
#' ### Weighted female income
#'
with(subset(ds, Gender == "Female"), wtd_mean(Income, w_o_ht))
with(subset(ds, Gender == "Female"), jk_wtd_mean_se(Income, list(Gender, State), w_o_ht))
#' alternatively:
with(ds, wtd_mean(Income, w_o_ht * (Gender == "Female")))
with(ds, jk_wtd_mean_se(Income, list(Gender, State), w_o_ht * (Gender == "Female")))
#'
#' We would get the same results with weights normed differently.
#' 
#' ### 'Sum to $n$' weights
#' 
ds$w_o_sn <- with(ds, (N/n)/mean(N/n))
sum(ds$w_o_sn)

with(ds, wtd_mean(Income, w_o_sn))
with(ds, jk_wtd_mean_se(Income, list(Gender,State), w_o_sn))
#' 
#' ### 'Sum to 1' weights:
#' 
ds$w_o_s1 <- with(ds, (N/n)/sum(N/n))
sum(ds$w_o_s1)

with(ds, wtd_mean(Income, w_o_s1))
with(ds, jk_wtd_mean_se(Income, list(Gender,State), w_o_s1))
#'
#' ## Linear combinations
#' 
#' ### Functions for linear combinations
#' 
std <- function(x, div = sum(x)) x/div
#' 
lin_comb <- function(x, w) sum(x*w)    
#'
#' Note the relationship between 'wtd_mean' and 'lin_comb':
#' 
with(ds, wtd_mean( Income, w_o_ht))
with(ds, lin_comb( Income, std(w_o_ht)))
#'
#' For jacknife estimates of a linear combination, we use the following function.
#'
jk_lin_comb_se <- function(x, by, w, check = TRUE) jk_wtd_mean_se(x, by, w, FUN = lin_comb)

with(ds, jk_wtd_mean_se(Income, list(Gender, State), w_o_ht))
with(ds, jk_lin_comb_se(Income, list(Gender, State), std(w_o_ht)))

#' 
#' #### Gender gap in mean salary in Utah
#' 
#' With linear combinations, we can estimate population parameters that can't be
#' estimated as weighted means. For example, the Gender gap in Utah:
#' 
coeffs <- with(ds, 
               std((State == "Utah") * (Gender == "Male") * w_o_ht)
               - std((State == "Utah") * (Gender == "Female") * w_o_ht) )
names(coeffs) <- with(ds, interaction(State,Gender))
cbind(coeffs)
with(ds, lin_comb(Income, coeffs))
with(ds, jk_lin_comb_se(Income, list(State, Gender), coeffs))
#'
#' #### Comparing two states
#' 
#' To compare Utah with Idaho using the population proportions of Gender:
#' 
coeffs <- with(ds, 
               std((State == "Utah") * w_o_ht)
               - std((State == "Idaho") *  w_o_ht) )
coeffs
names(coeffs) <- with(ds, interaction(State,Gender))
cbind(coeffs)
with(ds, lin_comb(Income, coeffs))
with(ds, jk_lin_comb_se(Income, list(State, Gender), coeffs))
#'
#' #### Unweighted average of two states
#' 
#' The unweighted average of Utah and Idaho using the Gender proportion in each state:
#' 
coeffs <- with(ds, 
               .5 * std((State == "Utah") * w_o_ht)
               + .5 * std((State == "Idaho") *  w_o_ht) )
coeffs
with(ds, lin_comb(Income, coeffs))
with(ds, jk_lin_comb_se(Income, list(State, Gender), coeffs))        
#'
#' #### The Gender gap within each state
#' 
states <- c("Idaho", "Utah", "Arizona")
names(states) <- states
coefs_gap <- lapply( states,
                     function(state)
                             with(ds, std((State == state) * (Gender == "Male") * w_o_ht) -
                                          std((State == state) * (Gender == "Female") * w_o_ht) ))
mat <- do.call(cbind,coefs_gap)
rownames(mat) <- with(ds, interaction(State,Gender))
MASS::fractions(mat)
with(ds, lapply(coefs_gap, function(w) lin_comb(Income,w)))
with(ds, lapply(coefs_gap, function(w) jk_lin_comb_se(Income,list(Gender, State), w)))
#' 
#' ## Hierarchical weights
#' 
#' The following illustrates how to create within-state weights and then how to modify them to form overall 
#' weights.
#' 
#' We can start with the convenient fact that Horvitz-Thompson weights 
#' serve as both within-state and
#' overall weights.  
#' To get within-state 'sum to $n$' weights, we need to norm the 
#' Horvitz-Thompson weights
#' within each state. 
#' We need to divide the Horvitz-Thompson weights by their within-state means:
#' 
ds$ht_ws_mean <- with(ds, capply(w_o_ht, State, mean))
ds$w_ws_sn <- with(ds, w_o_ht / ht_ws_mean)   # within-state 'sum to n' mean

with(ds, tapply(w_ws_sn, State, sum)) # check that they sum to within-state n
tab(ds, ~ State)
#'
#' As noted earlier, these weights need to be multiplied by $\frac{N_s/N}{n_s/n}$ to transform them into
#' overall 'sum to $n$' weights. The easiest way to generate $N_s$ is to use the Horvitz-Thomson weights:
#' 
ds$N_s <- with(ds, capply(w_o_ht, State, sum))
ds$n_s <- with(ds, capply(w_o_ht, State, length))
N_total <- sum(ds$w_o_ht)
n_total <- nrow(ds)

ds$w_o_sn2 <- with(ds, w_ws_sn * (N_s / N_total) / (n_s/n_total))
ds
#' Comparing these weighs with previous ones:
with(ds, max(abs(w_o_sn2 - w_o_sn)))
#'
#' Estimating means per stratum
#'
ds$mean_sg <- capply(ds, ds[c('State','Gender')], with, wtd_mean(Income, w_o_ht))
ds$mean_sg_se <- capply(ds, ds[c('State','Gender')], with, jk_wtd_mean_se(Income, w = w_o_ht))
ds$mean_s <- capply(ds, ds$State, with, wtd_mean(Income, w_o_ht))
ds$mean_s_se <- capply(ds, ds$State, with, jk_wtd_mean_se(Income, Gender, w_o_ht))
ds$mean_g <- capply(ds, ds$Gender, with, wtd_mean(Income, w_o_ht))
ds$mean_g_se <- capply(ds, ds$Gender, with, jk_wtd_mean_se(Income, State, w_o_ht))
# Define a round method for data frames so round will work on a data frame with non-numeric variables
round.data.frame <- function(x, digits = 0) as.data.frame(lapply(x, function(x) if(is.numeric(x)) round(x, digits = digits) else x))  
round(ds, 3)
dssum <- up(ds, ~ State/Gender) # keep all State x Gender invariant variables
round(dssum, 3)
#'
#' ## Function to combine population Ns with sample
#'
ht_est <- function(sam, pop, fmla){
        # assume
        strat.vars <- names(pop) %less% "N"
        sam <- merge(sam, pop, all = T) 
        sel.mf <- sam[strat.vars]
        strat.fmla <- formula(sel.mf)
        
        sam$n <- capply(1:nrow(sam), sam[strat.vars], length)
        sam$ht <- with(sam, N/n)
        ests <- model.frame(fmla, sam, na.action = na.include)
        n_ests <- length(ests)
        names_ests <- names(ests)
        ht <- sam$ht
        inds <- 1:nrow(sam)
        
        for(i in seq_len(n_ests)) {
                xx <- ests[[i]]
                xn <- names_ests[i]
                sam[[paste0(xn,'.est')]]  <- wtd_mean(xx,ht)
                sam[[paste0(xn,'.est.se')]]  <- jk_wtd_mean_se(xx,sel.mf,ht)
#                sam$x_est.g <- capply(sam, sel.mf, with, wtd_mean(x,ht))
                sam[[paste0(xn,'.est.g')]]  <- 
                        capply(inds, sel.mf, function(ii) wtd_mean(xx[ii],ht[ii]))
                sam[[paste0(xn,'.est.g.se')]]  <- 
                        capply(inds, sel.mf, function(ii) 
                                jk_wtd_mean_se(xx[ii],sel.mf[ii,],ht[ii]))
                
                # sam$x_est.g <- capply(sam, sel.mf, with, jk_wtd_mean_se(x,sel.mf,ht))
                
        }
        up(sam, sel.mf)
}
system.time(
zz<-ht_est(ds,dpop, ~Income)
)
        
        
        
        
}
formula(dpop) - N
names(dpop) %less% "N"
heathermkrause/WWC documentation built on May 17, 2019, 3:20 p.m.