uniform.select: Derive a subset of a large dataset

Description Usage Arguments Value Author(s) See Also Examples

View source: R/bigpca.R

Description

Either randomly or uniformly select rows or columns from a large dataset to form a new smaller dataset.

Usage

1
2
uniform.select(bigMat, keep = 0.05, rows = TRUE, dir = "",
  random = TRUE, ram.gb = 0.1)

Arguments

bigMat

a big.matrix object, or any argument accepted by get.big.matrix(), which includes paths to description files or even a standard matrix object.

keep

numeric, by default a proportion (decimal) of the original number of rows/columns to choose for the subset. Otherwise if an integer>2 then will assume this is the size of the desired subset, e.g, for a dataset with 10,000 rows where you want a subset size of 1,000 you could set 'keep' as either 0.1 or 1000.

rows

logical, whether the subset should be of the rows of bigMat. If rows=FALSE, then the subset is chosen from columns, would be equivalent to calling subpc.select(t(bigMat)), but avoids actually performing the transpose which can save time for large matrices.

dir

directory containing the filebacked.big.matrix, same as dir for get.big.matrix.

random

logical, passed to uniform.select(), whether to take a random or uniform selection of columns (or rows if rows=FALSE) to run the subset PCA.

ram.gb

maximum size of the matrix in gigabytes for the subset PCA, 0.1GB is the default which should result in minimal processing time on a typical system. Increasing this increases the processing time, but also the representativeness of the subset chosen. Note that some very large matrices will not be able to be processed by this function unless this parameter is increased; basically if the dimension being thinned is more than 5 this memory limit (see estimate.memory() from NCmisc).

Value

A set of row or column indexes (depents on 'rows' parameter) of uniformly distributed (optionally reproduceable) or randomly selected variables in the matrix.

Author(s)

Nicholas Cooper

See Also

subpc.select

Examples

1
2
3
4
5
mat <- matrix(rnorm(200*100),ncol=200)  # standard matrix
bmat <- as.big.matrix(mat)              # big.matrix
ii1 <- uniform.select(bmat,.05,rows=TRUE) # thin down to 5% of the rows
ii2 <- uniform.select(bmat,45,rows=FALSE,random=TRUE) # thin down to 45 columns
prv(ii1,ii2)

bigpca documentation built on Nov. 22, 2017, 1:02 a.m.