CrossValidation for choosing the rank of an SVD approximation.
Description
Perform Wold or Gabrielstyle crossvalidation for determining the appropriate rank SVD approximation of a matrix.
Usage
1 2 3 4  cv.svd.gabriel(x, krow = 2, kcol = 2,
maxrank = floor(min(n  n/krow, p  p/kcol)))
cv.svd.wold(x, k = 5, maxrank = 20, tol = 1e4, maxiter = 20)

Arguments
x 
the matrix to crossvalidate. 
k 
the number of folds (for Woldstyle CV). 
krow 
the number of row folds (for Gabrielstyle CV). 
kcol 
the number of column folds (for Gabrielstyle CV). 
maxrank 
the maximum rank to crossvalidate up to. 
tol 
the convergence tolerance for 
maxiter 
the maximum number of iterations for

Details
These functions are for crossvalidating the SVD of a matrix. They
assume a model $X = U D V' + E$ with the terms being signal
and noise, and try to find the best rank to truncate the SVD of x
at for minimizing prediction error. Here, prediction error is measured
as sum of squares of residuals between the truncated SVD and the
signal part.
For both types of crossvalidation, in each replicate we leave out part of the matrix, fit an SVD approximation to the leftin part, and measure prediction error on the leftout part.
In Woldstyle crossvalidation, the holdout set is "speckled", a random set
of elements in the matrix. The missing elements are predicted using
impute.svd
.
In Gabrielstyle crossvalidation, the holdout set is "blocked". We
permute the rows and columns of the matrix, and leave out the lowerright
block. We use a modified Schurcomplement to predict the heldout block.
In Gabrielstyle, there are krow*kcol
total folds.
Value
call 
the function call 
msep 
the mean square error of prediction (MSEP); this is a matrix
whose columns contain the mean square errors in the predictions of the
holdout sets for ranks 0, 1, ..., 
maxrank 
the maximum rank for which prediction error is estimated;
this is equal to 
krow 
the number of row folds (for Gabrielstyle only). 
kcol 
the number of column folds (for Gabrielstyle only). 
rowsets 
the partition of rows into 
colsets 
the partition of the columns into 
k 
the number of folds (for Woldstyle only). 
sets 
the partition of indices into 
Note
Gabriel's version of crossvalidation was for leaving out a single element of the matrix, which corresponds to nbypfold. Owen and Perry generalized Gabriel's idea to larger holdouts, showing that 2by2fold crossvalidation often works better.
Wold's original version of crossvalidation did not use the EM algorithm to estimate the SVD. He recommend using the NIPALS algorithm instead, which has since faded into obscurity.
Woldstyle crossvalidation takes a lot more computation than Gabrielstyle.
The maxrank
, tol
, and maxiter
have been chosen to give
up some accuracy in the name of expediency. They may need to be adjusted to
get the best results.
Author(s)
Patrick O. Perry
References
Gabriel, K.R. (2002). Le biplot–outil d'explaration de données multidimensionelles. J. Roy. Stat. Soc. Series B 40 186–196.
Owen, A.B. and Perry, P.O. (2009). Bicrossvalidation of the SVD and the nonnegative matrix factorization. Annals of Applied Statistics 3(2) 564–594.
Wold, S. (1978). Crossvaliditory estimation of the number of components in factor and principal components models. Technometrics 20(4) 397–405.
See Also
impute.svd
,
plot.cvsvd
,
print.cvsvd
summary.cvsvd
Examples
1 2 3 4 5 6 7 8 9 10 11 12  # generate a rank2 matrix plus noise
n < 50; p < 20; k < 2
u < matrix( rnorm( n*k ), n, k )
v < matrix( rnorm( p*k ), p, k )
e < matrix( rnorm( n*p ), n, p )
x < u %*% t(v) + e
# perform 5fold Woldstyle crossvalidtion
(cvw < cv.svd.wold( x, 5, maxrank=10 ))
# perform (2,2)fold Gabrielstyle crossvalidation
(cvg < cv.svd.gabriel( x, 2, 2, maxrank=10 ))
