library("predcomps")
print.data.frame <- function(...) base::print.data.frame(..., row.names=FALSE)

This is a note about how the weights are (should be) constructed. The observation here is that our estimated APCs do not approach the theoretical APCs in the limit as we get more data. This is a clue that we should tweak the definition of the weights, but I haven't implemented such an improvement yet.

Large $N$ Limit

As we get more data, it would be nice if the APC we compute tends toward the right answer, equation (2) of Gelman and Pardoe 2007. I don't care about asymptotics as much as some people, but if we don't get the right answer in the limit, that's at least a clue that we might not be doing as well as we can for smaller $N$. This note shows that unless we adjust the weighting function as we get more data, we won't have that nice property.

makeExampleDF <- function(N) {
  exampleDF <- data.frame(
    v=c(3,3,7,7),  
    u=c(10,20,12,22) 
    )[rep(c(1,2,3,4),c(.4*N,.4*N,.1*N,.1*N)),]
  exampleDF <- transform(exampleDF, v = v + rnorm(nrow(exampleDF), sd=.001))
  return(exampleDF)
}

Just as in the note "Normalizing Weights", the APC should be:

$$.8 \delta_u(10 \rightarrow 20, 3, f) + 0.2 \delta_u(12 \rightarrow 22, 7, f) = (.8)(3) + (.2)(6) = 3.8$$

We get almost the same APC with 300 data points as 100:

GetSingleInputApcs(function(df) return(df$u * df$v), makeExampleDF(100), u="u", v="v")$PerUnitInput.Signed
GetSingleInputApcs(function(df) return(df$u * df$v), makeExampleDF(300), u="u", v="v")$PerUnitInput.Signed

If we're looking at one value for $v$, $v=v_0$, the tradeoff in determining the weights is:

  1. $v$'s closer to $v_0$ will do a better job representing the distribution of $u$ conditional on $v=v_0$
  2. but if too few $v$'s get too much of the weight, our estimate for the conditional distribution of $u$ will be too noisy

The reason our estimate didn't improve with more data is that we're not moving more weight to nearby points as we get more data. For any $N$, we're presently putting roughly the same amount of mass at the same distances. As we get more data, we can afford to put more weight closer to $v$, because (2) becomes less of a problem. A couple ideas are:



dchudz/predcomps documentation built on May 15, 2019, 1:48 a.m.