# bootSVD: Calculates bootstrap distribution of PCA (i.e. SVD) results In bootSVD: Fast, Exact Bootstrap Principal Component Analysis for High Dimensional Data

## Description

Applies fast bootstrap PCA, using the method from (Fisher et al., 2014). Dimension of the sample is denoted by p, and sample size is denoted by n, with p>n.

## Usage

 ```1 2 3 4``` ```bootSVD(Y = NULL, K, V = NULL, d = NULL, U = NULL, B = 50, output = "HD_moments", verbose = getOption("verbose"), bInds = NULL, percentiles = c(0.025, 0.975), centerSamples = TRUE, pattern_V = "V_", pattern_Vb = "Vb_") ```

## Arguments

 `Y` initial data sample, which can be either a matrix or a `ff` matrix. `Y` can be either tall (p by n) or wide (n by p). If `Y` is entered and `V`, `d` and `U` (see definitions below) are not entered, then `bootSVD` will also compute the SVD of `Y`. In this case where the SVD is computed, `bootSVD` will assume that the larger dimension of `Y` is p, and the smaller dimension of code Y is n (i.e. `bootSVD` assumes that (p>n). This assumption can be overriden by manually entering `V`, `U` and `d`. For cases where the entire data matrix can be easily stored in memory (e.g. p<50000), it is generally appropriate to enter `Y` as a standard matrix. When `Y` is large enough that matrix algebra on `Y` is too demanding for memory though, `Y` should be entered as a `ff` object, where the actual data is stored on disk. If `Y` has class `ff`, and `V`, `d` or `U` is not entered, then block matrix algebra will be used to calculate the PCs and bootstrap PCs. The results of these calculations will be returned as `ff` objects as well. `K` number of PCs to calculate the bootstrap distribution for. `V` (optional) the (p by n) full matrix of p-dimensional PCs for the sample data matrix. If `Y` is wide, these are the right singular vectors of `Y` (i.e. Y=UDV'). If `Y` is tall, these are the left singular vectors of `Y` (i.e. Y=VDU'). In general it is assumed that p>n, however, this can be overridden by setting `V` and `U` appropriately. Like `Y`, the argument `V` can be either a standard matrix or a `ff` matrix. If `V` is a `ff` object, the bootstrap PCs, if requested, will be returned as `ff` objects as well. `d` (optional) n-length vector of the singular values of `Y`. For example, if `Y` is tall, then we have Y=VDU' with `D=diag(d)`. `U` (optional) the (n by n) full set of n-dimensional singular vectors of `Y`. If `Y` is wide, these are the left singular vectors of `Y` (i.e. Y=UDV'). If `Y` is tall, these are the right singular vectors of `Y` (i.e. Y=VDU'). `B` number of bootstrap samples to compute. `output` a vector telling which descriptions of the bootstrap distribution should be calculated. Can include any of the following: 'initial_SVD', 'HD_moments', 'full_HD_PC_dist', and 'HD_percentiles'. See below for explanations of these outputs. For especially high dimensional cases, caution should be used if requesting 'full_HD_PC_dist' due to potential storage limitations. `verbose` if TRUE, the function will print progress during calculation procedure. `bInds` a (B by n) matrix of bootstrap indeces, where `B` is the number of bootstrap samples, and `n` is the sample size. The purpose of setting a specific bootstrap sampling index is to allow the results to be more precisely compared against standard bootstrap PCA calculations. If entered, the `bInds` argument will override the `B` argument. `percentiles` a vector containing percentiles to be used to calculate element-wise percentiles across the bootstrap distribution (both across the distribution of p-dimensional components and the distribution of n-dimensional components). For example, `percentiles=c(.025,.975)` will return the 2.5 and 97.5 percentiles, which can be used as 95 percent bootstrap percentile CIs. Alternatively, a longer vector of percentiles can be entered. `centerSamples` whether each bootstrap sample should be centered before calculating the SVD. `pattern_V` if `Y` is a class `ff` object, then the returned PCs of `Y` will also be a class `ff` object. `pattern_V` is passed to `ff` in creation of the `initial_SVD` output. Specifically, `pattern_V` is a filename prefix used for storing the high dimensional PCs of the original sample. `pattern_Vb` if `Y` or `V` is a class `ff` object, then the returned bootstrap PCs will also be class `ff` objects. `pattern_Vb` is passed to `ff` in creation of the `full_HD_PC_dist` output. Specifically, `pattern_Vb` is a filename prefix used for storing the high dimensional bootstrap PCs.

## Details

Users might also consider changing the global options `ffbatchbytes`, from the `ff` package, and `mc.cores`, from the `parallel` package. When `ff` objects are entered as arguments for `bootSVD`, the required matrix algebra is done using block matrix alebra. The `ffbatchbytes` option determines the size of the largest block matrix that will be held in memory at any one time. The `mc.cores` option (set to 1 by default) determines the level of parallelization to use when calculating the high dimensional distribution of the bootstrap PCs (see `mclapply`).

## Value

`bootSVD` returns a list that can include any of the following elements, depending on what is specified in the `output` argument:

initial_SVD

The singular value decomposition of the centered, original data matrix. `initial_SVD` is a list containing `V`, the matrix of p-dimensional principal components, `d`, the vector of singular values of `Y`, and `U`, the matrix of n-dimensional singular vectors of `Y`.

HD_moments

A list containing the bootstrap expected value (`EPCs`), element-wise bootstrap variance (`varPCs`), and element-wise bootstrap standard deviation (`sdPCs`) for each of the p-dimensional PCs. Each of these three elements of `HD_moments` is also a list, which contains K vectors, one for each PC. `HD_moments` also contains `momentCI`, a K-length list of (p by 2) matrices containing element-wise moment based confidence intervals for the PCs.

full_HD_PC_dist

A B-length list of matrices (or `ff` matrices), with the b^{th} list element equal to the (p by K) matrix of high dimensional PCs for the b^{th} bootstrap sample.
For especially high dimensional cases when the output is returned as `ff` matrices, caution should be used if requesting 'full_HD_PC_dist' due to potential storage limitations.
To reindex these PCs by k (the PC index) as opposed to b (the bootstrap index), see `reindexMatricesByK`. Again though, caution shoulded be used when reindexing PCs stored as `ff` objects, as this will double the number of files stored.

HD_percentiles

A list of K matrices, each of dimension (p by q), where q is the number of percentiles requested (i.e. q = `length(percentiles)`). The k^{th} matrix in `HD_percentiles` contains element-wise percentiles for the k^{th}, p-dimensional PC.

In addition, the following results are always included in the output, regardless of what is specified in the `output` argument:

 `full_LD_PC_dist` A B-length list of matrices, with the b^{th} list element equal to the (p by K) matrix of PCs of the scores in the b^{th} bootstrap sample. To reindex these vectors by k (the PC index), see `reindexMatricesByK`. `d_dist` A `B`-length list of vectors, with the b^{th} element of `d_dist` containing the n-length vector of singular values from the b^{th} bootstrap sample. To reindex these values by k (the PC index), see `reindexVectorsByK`. `U_dist` A `B`-length list of (n by K) matrices, with the columns of the b^{th} matrix containing the n-length singular vectors from the b^{th} bootstrap sample. To reindex these vectors by k (the PC index), see `reindexMatricesByK`. `LD_moments` A list that is comparable to `HD_moments`, but that instead describes the variability of the n-dimensional principal components of the resampled score matrices. `LD_moments` contains the bootstrap expected value (`EPCs`), element-wise bootstrap variances (`varPCs`), and element-wise bootstrap standard deviations (`sdPCs`) for each of the n-dimensional PCs. Each of these three elements of `LD_moments` is also a list, which contains K vectors, one for each PC. `LD_moments` also contains `momentCI`, a list of K (n by 2) matrices containing element-wise, moment-based confidence intervals for the PCs. `LD_percentiles` A list of K matrices, each of dimension (p by q), where q is the number of percentiles requested (i.e. q = `length(percentiles)`). The k^{th} matrix in `LD_percentiles` contains element-wise percentiles for the k^{th} n-dimensional PC.

## References

Aaron Fisher, Brian Caffo, and Vadim Zipunnikov. Fast, Exact Bootstrap Principal Component Analysis for p>1 million. 2014. http://arxiv.org/abs/1405.0922

## Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102``` ```#use small n, small B, for a quick illustration set.seed(0) Y<-simEEG(n=100, centered=TRUE, wide=TRUE) b<-bootSVD(Y, B=50, K=2, output= c('initial_SVD', 'HD_moments', 'full_HD_PC_dist', 'HD_percentiles'), verbose=interactive()) b #explore results matplot(b\$initial_SVD\$V[,1:4],type='l',main='Fitted PCs',lty=1) legend('bottomright',paste0('PC',1:4),col=1:4,lty=1,lwd=2) ###################### # look specifically at 2nd PC k<-2 ###### #looking at HD variability #plot several draws from bootstrap distribution VsByK<-reindexMatricesByK(b\$full_HD_PC_dist) matplot(t(VsByK[[k]][1:20,]),type='l',lty=1, main=paste0('20 Draws from bootstrap\ndistribution of HD PC ',k)) #plot pointwise CIs matplot(b\$HD_moments\$momentCI[[k]],type='l',col='blue',lty=1, main=paste0('CIs For HD PC ',k)) matlines(b\$HD_percentiles[[k]],type='l',col='darkgreen',lty=1) lines(b\$initial_SVD\$V[,k]) legend('topright',c('Fitted PC','Moment CIs','Percentile CIs'), lty=1,col=c('black','blue','darkgreen')) abline(h=0,lty=2,col='darkgrey') ###### # looking at LD variability # plot several draws from bootstrap distribution AsByK<-reindexMatricesByK(b\$full_LD_PC_dist) matplot(t(AsByK[[k]][1:50,]),type='l',lty=1, main=paste0('50 Draws from bootstrap\ndistribution of LD PC ',k), xlim=c(1,10),xlab='PC index (truncated)') # plot pointwise CIs matplot(b\$LD_moments\$momentCI[[k]],type='o',col='blue', lty=1,main=paste0('CIs For LD PC ',k),xlim=c(1,10), xlab='PC index (truncated)',pch=1) matlines(b\$LD_percentiles[[k]],type='o',pch=1,col='darkgreen',lty=1) abline(h=0,lty=2,col='darkgrey') legend('topright',c('Moment CIs','Percentile CIs'),lty=1, pch=1,col=c('blue','darkgreen')) #Note: variability is mostly due to rotations with the third and fourth PC. # Bootstrap eigenvalue distribution dsByK<-reindexVectorsByK(b\$d_dist) boxplot(dsByK[[k]]^2,main=paste0('Covariance Matrix Eigenvalue ',k), ylab='Bootstrap Distribution', ylim=range(c(dsByK[[k]]^2,b\$initial_SVD\$d[k]^2))) points(b\$initial_SVD\$d[k]^2,pch=18,col='red') legend('bottomright','Sample Value',pch=18,col='red') ################## #Example with ff input library(ff) Yff<-as.ff(Y, pattern='Y_') # If desired, change options in 'ff' package to # adjust the size of matrix blocks held in RAM. # For example: # options('ffbatchbytes'=100000) ff_dir<-tempdir() pattern_V <- paste0(ff_dir,'/V_') pattern_Vb <- paste0(ff_dir,'/Vb_') bff <- bootSVD(Yff, B=50, K=2, output=c('initial_SVD', 'HD_moments', 'full_HD_PC_dist', 'HD_percentiles'), pattern_V= pattern_V, pattern_Vb=pattern_Vb, verbose=interactive()) # Note that elements of full_HD_PC_dist and initial_SVD # have class 'ff' str(lapply(bff,function(x) class(x[]))) #Show some results of bootstrap draws plot(bff\$full_HD_PC_dist[][,k],type='l') #Reindexing by K will create a new set of ff files. VsByKff<-reindexMatricesByK(bff\$full_HD_PC_dist, pattern=paste0(ff_dir,'/Vk_')) physical(bff\$full_HD_PC_dist[])\$filename physical(VsByKff[])\$filename matplot(t(VsByKff[[k]][1:10,]),type='l',lty=1, main=paste0('Bootstrap Distribution of PC',k)) # Saving and moving results: saveRDS(bff,file=paste0(ff_dir,'/bff.rds')) close(bff\$initial_SVD\$V) physical(bff\$initial_SVD\$V)\$filename # If the 'ff' files on disk are moved or renamed, # this filename attribute can be changed: old_ff_path <- physical(bff\$initial_SVD\$V)\$filename new_ff_path <- paste0(tempdir(),'/new_V_file.ff') file.rename(from= old_ff_path, to= new_ff_path) physical(bff\$initial_SVD\$V)\$filename <- new_ff_path matplot(bff\$initial_SVD\$V[,1:4],type='l',lty=1) ```

bootSVD documentation built on Feb. 2, 2021, 5:06 p.m.