set.seed(0)

Cross-validation is often used in machine learning to judge how well a model is fit. Instead of using the entire data set to fit the model, it will use one part of the data set to fit a model and then test the model on the remaining data. This gives an idea of how well the model will generalize to indpendent data.

Leave-one-out predictions using Gaussian processes

Leave-one-out prediction uses an entire model fit to all the data except a single point, and then makes a prediction at that point which can be compared to the actual value. It seems like this may be very expensive to do, but it is actually an inexpensive computation for a Gaussian process model, as long as the same parameters are used from the full model. This will bias the predictions to better results than if parameters were re-estimated.

Normally each prediction point requires solving a matrix equation. To predict the output, $y$, at point $\mathbf{x}$, given input data in matrix $X_2$ and output $\mathbf{y_2}$, we use the equation $$ \hat{y} = \hat{\mu} + R(\mathbf{x},~X_2) R(X_2)^{-1}( \mathbf{y_2} - \mu\mathbf{1_n})) $$ For leave-one-out predictions, the matrix $X_2$ will have all the design points except for the one we are predicting at, and thus will be different for each one. However, we will have the correlation matrix $R$ for the full data set from estimating the parameters, and there is a shortcut to find the inverse of a matrix leaving out a single row and column.

There is significant speed-up by using a multiplication instead of a matrix solve. The code chunk below shows that solving with a square matrix with 200 rows is over 30 times slower than a matrix multiplication.

n <- 200
m1 <- matrix(runif(n*n),ncol=n)
b1 <- runif(n)
if (requireNamespace("microbenchmark", quietly = TRUE)) {
  microbenchmark::microbenchmark(solve(m1, b1), m1 %*% b1)
}

Getting the inverse of a submatrix

Suppose we have a matrix $K$ and know its inverse $K^{-1}$. Suppose that $K$ has block structure $$ K = \begin{bmatrix} A~B \ C~D \end{bmatrix}$$ Now we want to find out how to find $A^{-1}$ using $K^{-1}$ instead of doing the full inverse. We can write $K^{-1}$ in block structure $$K^{-1} = \begin{bmatrix} E~F \ G~H \end{bmatrix}$$

Now we use the fact that $K K^{-1} = I$ $$ \begin{bmatrix} I~0 \ 0~I \end{bmatrix} = \begin{bmatrix} A~B \ C~D \end{bmatrix}\begin{bmatrix} E~F \ G~H \end{bmatrix} $$

This gives the equations $$ AE + BG = I$$ $$ AF + BH = 0$$ $$ CE + DG = 0$$ $$ CF + DH = I$$

Solving the first equation gives that $$ A = (I - BG)E^{-1}$$ or $$ A^{-1} = E (I - BG) ^{-1}$$

Leave-one-out covariance matrix inverse for Gaussian processes

For Gaussian processes we can consider the block matrix for the covariance (or correlation) matrix where a single row and its corresponding column is being removed. Let the first $n-1$ rows and columns be the covariance of the points in design matrix $X$, while the last row and column are the covariance for the vector $\mathbf{x}$ with $X$ and $\mathbf{x}$. Then we can have

$$ K = \begin{bmatrix} C(X,X)~ C(X,\mathbf{x}) \ C(\mathbf{x},X)~C(\mathbf{x},\mathbf{x}) \end{bmatrix}$$

Using the notation from the previous subsection we have $A = C(X,X)$ and $B=C(X,\mathbf{x})$, and $E$ and $G$ will be submatrices of the full $K^{-1}$. $B$ is a column vector, so I'll write it as a vector $\mathbf{b}$, and $G$ is a row vector, so I'll write it as a vector $\mathbf{g}^T$. So we have $$ C(X,X)^{-1} = E(I-C(X,x)G)^{-1}$$ So if we want to calculate $$ A^{-1} = E (I - \mathbf{b}\mathbf{g}^T) ^{-1}$$ we still have to invert $I-BG$, which is a large matrix. However this can be done efficiently since it is a rank one matrix using the Sherman-Morrison formula. $$ (I - \mathbf{b}\mathbf{g}^T)^{-1} = I^{-1} - \frac{I^{-1}\mathbf{b}\mathbf{g}^TI^{-1}}{1+\mathbf{g}^TI^{-1}\mathbf{b}} = I - \frac{\mathbf{b}\mathbf{g}^T}{1+\mathbf{g}^T\mathbf{b}} $$ Thus we have the shortcut for $A^{-1}$ that is only multiplication $$ A^{-1} = E (I - \frac{\mathbf{b}\mathbf{g}^T}{1+\mathbf{g}^T\mathbf{b}})$$

To speed this up it should be calculated as

$$ A^{-1} = E - \frac{(E\mathbf{b})\mathbf{g}^T}{1+\mathbf{g}^T\mathbf{b}}$$ Below demonstrates that we get a speedup of almost twenty by using this shortcut.

set.seed(0)
corr <- function(x,y) {exp(sum(-30*(x-y)^2))}
n <- 200
d <- 2
X <- matrix(runif(n*d),ncol=2)
R <- outer(1:n,1:n, Vectorize(function(i,j) {corr(X[i,], X[j,])}))
Rinv <- solve(R)
A <- R[-n,-n]
Ainv <- solve(A)
E <- Rinv[-n, -n]
b <- R[n,-n]
g <- Rinv[n,-n]
Ainv_shortcut <- E + E %*% b %*% g / (1-sum(g*b))
summary(c(Ainv - Ainv_shortcut))
if (requireNamespace("microbenchmark", quietly = TRUE)) {
  microbenchmark::microbenchmark(solve(A), E + E %*% b %*% g / (1-sum(g*b)))
}

In terms of the covariance matrices already calculated, this is the following, where $M_{-i}$ is the matrix $M$ with the ith row and column removed, and $M_{i,-i}$ is the ith row of the matrix $M$ with the value from the ith column removed.

$$ R(X_{-i})^{-1} = R(X){-i} - \frac{(R(X){-i}R(X){-i,i}) (R(X)^{-1}){i,-i}^T }{1 + (R(X)^{-1}){i,-i}^T R(X){-i,i}}$$

Leave-one-out prediction

Recall that the predicted mean at a new point is $$ \hat{y} = \hat{\mu} + R(\mathbf{x},~X_2) R(X_2)^{-1}( \mathbf{y_2} - \mu\mathbf{1_n})) $$

$$ \hat{y} = \hat{\mu} + R(\mathbf{x_i},~X_{-1}) R(X_{-i})^{-1}( \mathbf{y_{-i}} - \mu\mathbf{1_n})) $$



CollinErickson/GauPro documentation built on March 25, 2024, 11:20 p.m.