set.seed(89890) opts_chunk$set(tidy=FALSE) library("plyr") library(predcomps) print.data.frame <- function(...) base::print.data.frame(..., row.names=FALSE)
In computing the APC, we assign weights to pairs of observations based on the Mahalanobis distance between the corresponding $v$'s. This note uses a toy example to argue that we must renormalize the weights so that when we group by the first element of each pair and sum the weights, each group has the same sum-of-weights.
This suggested renormalization is not discussed in Gelman and Pardoe 2007.
If $u$ is an input of interest and $v$ are the other inputs, recall that the APC is the expected value of a quantity formed upon sampling from $v$, sampling twice from $u$ conditional on $v$, and computing predictive comparisons using those $u$'s. See equation (2) of Gelman and Pardoe 2007.
If there were enough pairs of points with identical $v$, we could just use the sample distribution of $u$ given $v$. As noted in the paper, we may have few (if any) pairs of points with identical $v$. Even so, let's think through an example where we do have such identical pairs:
Suppose $v$ consists of only 1 input, which can take one of two values. For simplicity, assume $u$ has exactly two possible (equally likely) values at each $v$ (which are different depending on the value of $v$), so there is only one possible transition at each $v$. Here's an example:
exampleDF <- data.frame( v=c(3,3,7,7), u=c(10,20,12,22) )[rep(c(1,2,3,4),c(40,40,10,10)),]
# Count each u/v combination: kable(ddply(exampleDF, c("v","u"), function(df) data.frame(CountOfRows = nrow(df))))
Say we have a model $\hat{y} = f(u,v)$. I'll choose $\hat{y} = f(u,v) = uv$ for a simple example. (How the model is estimated is completely orthogonal to the questions addressed here, and I'll just pretend it's known exactly.)
Equation (2) in the paper says the numerator in the APC should be:
$$(.4)(.5)(.5)(f(20,3) - f(10, 3)) + (0.1)(.5)(.5)(f(22,7) - f(12,7)) $$
The .5's are the $p(u|v)$'s (and will cancel out in this case). (Terms with transition size 0 aren't included.)
The denominator is:
$$(.4)(.5)(.5)((20 - 10) + (.1)(.5)(.5)((22 - 12)$$
The ratio simplifies to:
$$.8 \delta_u(10 \rightarrow 20, 3, f) + 0.2 \delta_u(12 \rightarrow 22, 7, f)$$
This is all overkill for our very simple example, where it's easy to see that the APC is just $(.8)(3) + (.2)(6)$. But I wanted to be very concrete.
I'll compute it:
f <- function(u, v) return(u*v) ApcExact <- .8*(f(20,3) - f(10,3))/10 + .2*(f(22,7) - f(12,7))/10 ApcExact
Now imagine we don't have any exact duplicates of $v$. To get a corresponding example like that, I'll modify the first example by adding a really tiny bit of noise to $v$: $v_{new} = v + N(0,\epsilon)$.
exampleDF2 <- transform(exampleDF, v = v + rnorm(nrow(exampleDF), sd=.001))
Now we form pairs and compute weights as described in the paper. Here's a sample of the resulting data frame of pairs, just to get a sense of what it looks like:
pairsDF <- GetPairs(exampleDF2, u="u", v="v", renormalizeWeights=FALSE)
kable(pairsDF[sample(1:nrow(pairsDF), 12), ], row.names=FALSE)
Now pairs with nearby $v$'s (which would have been the same $v$'s previously) have high weights, where pairs from far-away $v$'s (which were different $v$'s in the previous example) have low weights. That's good.
But $v$ near 3 now has more weight in the data set for two reasons:
Reason (1) is good, but reason (2) is not so good.
In the data frame of pairs, the weights are all close to 0.14 or 1. Let's look at the joint distribution of $u$ and $v$ in just the pairs with weights close to 1:
pairsDF <- data.frame(vRounded = round(pairsDF$v), pairsDF) pairsHighWeightsDF <- subset(pairsDF, Weight > 0.9) ddply(pairsHighWeightsDF, c("vRounded","u"), function(df) data.frame(CountOfRows = nrow(df), ProportionOfRows = nrow(df)/nrow(pairsHighWeightsDF)))
We see that $v$'s near 7 makes up only about 5.7% of the pairs. (It would be exactly $(.2)(.2) = 4$%, except that when we form pairs to compute the APC we don't pair any row with itself.)
If we form the APC based on these pairs and these weights, we weight the $v$'s near 3 too much, so our APC is too low:
pairsDF$yHat1 <- f(pairsDF$u, pairsDF$v) pairsDF$yHat2 <- f(pairsDF$u.B, pairsDF$v) pairsDF$uDiff <- pairsDF$u.B - pairsDF$u ApcApprox1 <- with(pairsDF, sum(Weight * (yHat2 - yHat1) * sign(uDiff)) / sum(Weight * uDiff * sign(uDiff))) ApcApprox1
I showed the full computation above, but we can also use the GetAPC
function from this package:
GetSingleInputApcs(function(df) return(df$u * df$v), exampleDF2, u="u", v="v", renormalizeWeights=FALSE)$PerUnitInput.Signed
Instead, we can normalize Weights so that within each first element of the pair.
pairsDFWeightsNormalized <- ddply(pairsDF, "OriginalRowNumber", transform, Weight = Weight/sum(Weight)) ApcApprox2 <- with(pairsDFWeightsNormalized, sum(Weight * (yHat2 - yHat1) * sign(uDiff)) / sum(Weight * uDiff * sign(uDiff))) ApcApprox2
These renormalized Weights are the ones returned from GetPairs
by default, and used in GetAPC
by default:
GetSingleInputApcs(function(df) return(df$u * df$v), exampleDF2, u="u", v="v", renormalizeWeights=TRUE)$PerUnitInput.Signed
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.