In dchudz/predcomps: Average Predictive Comparisons

library("predcomps")
print.data.frame <- function(...) base::print.data.frame(..., row.names=FALSE)

This is a note about how the weights are (should be) constructed. The observation here is that our estimated APCs do not approach the theoretical APCs in the limit as we get more data. This is a clue that we should tweak the definition of the weights, but I haven't implemented such an improvement yet.

Large $N$ Limit

As we get more data, it would be nice if the APC we compute tends toward the right answer, equation (2) of Gelman and Pardoe 2007. I don't care about asymptotics as much as some people, but if we don't get the right answer in the limit, that's at least a clue that we might not be doing as well as we can for smaller $N$. This note shows that unless we adjust the weighting function as we get more data, we won't have that nice property.

makeExampleDF <- function(N) {
  exampleDF <- data.frame(
    v=c(3,3,7,7),  
    u=c(10,20,12,22) 
    )[rep(c(1,2,3,4),c(.4*N,.4*N,.1*N,.1*N)),]
  exampleDF <- transform(exampleDF, v = v + rnorm(nrow(exampleDF), sd=.001))
  return(exampleDF)
}

Just as in the note "Normalizing Weights", the APC should be:

$$.8 \delta_u(10 \rightarrow 20, 3, f) + 0.2 \delta_u(12 \rightarrow 22, 7, f) = (.8)(3) + (.2)(6) = 3.8$$

We get almost the same APC with 300 data points as 100:

GetSingleInputApcs(function(df) return(df$u * df$v), makeExampleDF(100), u="u", v="v")$PerUnitInput.Signed
GetSingleInputApcs(function(df) return(df$u * df$v), makeExampleDF(300), u="u", v="v")$PerUnitInput.Signed

If we're looking at one value for $v$, $v=v_0$, the tradeoff in determining the weights is:

$v$'s closer to $v_0$ will do a better job representing the distribution of $u$ conditional on $v=v_0$
but if too few $v$'s get too much of the weight, our estimate for the conditional distribution of $u$ will be too noisy

The reason our estimate didn't improve with more data is that we're not moving more weight to nearby points as we get more data. For any $N$, we're presently putting roughly the same amount of mass at the same distances. As we get more data, we can afford to put more weight closer to $v$, because (2) becomes less of a problem. A couple ideas are:

With the weights as $\frac{1}{k+d}$ ($d$ is the Mahalanobis distance), we could scale $k$ down as $N$ goes up.
Or we could use the weights we are now ($k=1$), except we drop (equivalent to setting the weight to 0) all but the closest $s(N)$ points to each $v$. The function $s$ needs to increases with $N$, but not as fast as $N$, e.g. maybe $s(N) = sqrt(N)$ probably works. This means we're always decreasing bias (sampling from closer to the right $v$) and also decreasing variance (more samples) as N increases. This would also be good for keeping run-times and memory usage under control as $N$ increases.

dchudz/predcomps documentation built on May 15, 2019, 1:48 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com