Selection of a representative variable subset

Description Usage Arguments Value Author(s) See Also Examples

View source: R/bigpca.R


Returns a subset (size='keep') of row or column numbers that are most representative of a dataset. This function performs PCA on a small subset of columns and all rows (when rows=TRUE, or vice -versa when rows=FALSE), and selects rows (rows=TRUE) most correlated to the first 'n' principle components, where 'n' is chosen by the function quick.elbow(). The number of variables selected corresponding to each component is weighted according to how much of the variance is explained by each component.


2, keep = 0.05, rows = TRUE, dir = getwd(),
  random = TRUE, = 0.1, ...)



a big.matrix, matrix or any object accepted by get.big.matrix()


numeric, by default a proportion (decimal) of the original number of rows/columns to choose for the subset. Otherwise if an integer>2 then will assume this is the size of the desired subset, e.g, for a dataset with 10,000 rows where you want a subset size of 1,000 you could set 'keep' as either 0.1 or 1000.


logical, whether the subset should be of the rows of bigMat. If rows=FALSE, then the subset is chosen from columns, would be equivalent to calling, but avoids actually performing the transpose which can save time for large matrices.


the directory containing the bigMat backing file (e.g, parameter for get.big.matrix()).


logical, passed to, whether to take a random or uniform selection of columns (or rows if rows=F) to run the subset PCA.

maximum size of the matrix in gigabytes for the subset PCA, 0.1GB is the default which should result in minimal processing time on a typical system. Increasing this increases the processing time, but also the representativeness of the subset chosen. Note that some very large matrices will not be able to be processed by this function unless this parameter is increased; basically if the dimension being thinned is more than 5 this memory limit (see estimate.memory() from NCmisc).


further parameters to pass to big.PCA() which performs the subset PCA used to determine the most representative rows (or columns).


A set of row or column indexes (depents on 'rows' parameter) of the most representative variables in the matrix, as defined by most correlated to principle components


Nicholas Cooper

See Also

thin,, big.PCA, get.big.matrix


mat <- matrix(rnorm(200*2000),ncol=200) # normal matrix
bmat <- as.big.matrix(mat)              # big matrix
ii <-,.05,rows=TRUE) # thin down to 5% of the rows
ii <-,45,rows=FALSE) # thin down to 45 columns
# show that rows=T is equivalent to rows=F of the transpose (random must be FALSE)
ii1 <-,.4,rows=TRUE,random=FALSE)
ii2 <-,.4,rows=FALSE,random=FALSE)

Example output

Loading required package: reader
Loading required package: NCmisc

Attaching package: 'reader'

The following objects are masked from 'package:NCmisc':

    cat.path, get.ext, rmv.ext

Loading required package: bigmemory
Loading required package: biganalytics
Loading required package: foreach
Loading required package: biglm
Loading required package: DBI
Warning messages:
1: replacing previous import 'reader::cat.path' by 'NCmisc::cat.path' when loading 'bigpca' 
2: replacing previous import 'reader::get.ext' by 'NCmisc::get.ext' when loading 'bigpca' 
3: replacing previous import 'reader::rmv.ext' by 'NCmisc::rmv.ext' when loading 'bigpca' 
 means for first 10 snps:
 [1] 0 0 0 0 0 0 0 0 0 0
 means for first 10 snps:
 [1] 0 0 0 0 0 0 0 0 0 0
 means for first 10 snps:
 [1] 0 0 0 0 0 0 0 0 0 0
 means for first 10 snps:
 [1] 0 0 0 0 0 0 0 0 0 0
[1] TRUE

bigpca documentation built on Nov. 22, 2017, 1:02 a.m.