thin: Reduce one dimension of a large matrix in a strategic way

Description Usage Arguments Value Author(s) See Also Examples

View source: R/bigpca.R

Description

Thin the rows (or columns) of a large matrix or big.matrix in order to reduce the size of the dataset while retaining important information. Percentage of the original size or a new number of rows/columns is selectable, and then there are four methods to choose the data subset. Simple uniform and random selection can be specified. Other methods look at the correlation structure of a subset of the data to derive non-arbitrary selections, using correlation, PCA, or association with a phenotype or some other categorical variable. Each of the four methods has a separate function in this package, which you can see for more information, this function is merely a wrapper to select one of the four.

Usage

1
2
3
thin(bigMat, keep = 0.05, how = c("uniform", "correlation", "pca",
  "association"), dir = "", rows = TRUE, random = TRUE, hi.cor = TRUE,
  least = TRUE, pref = "thin", verbose = FALSE, ret.obj = TRUE, ...)

Arguments

bigMat

a big.matrix object, or any argument accepted by get.big.matrix(), which includes paths to description files or even a standard matrix object.

keep

numeric, by default a proportion (decimal) of the original number of rows/columns to choose for the subset. Otherwise if an integer>2 then will assume this is the size of the desired subset, e.g, for a dataset with 10,000 rows where you want a subset size of 1,000 you could set 'keep' as either 0.1 or 1000.

how

character, only the first two characters are required and they are not case sensitive, select what method to use to perform subset selection, options are: 'uniform': evenly spaced selection when random=FALSE, or random selection otherwise; see uniform.select(). 'correlation': most correlated subset when hi.cor=TRUE, least correlated otherwise; see subcor.select(). 'pca': most representative variables of the principle components of a subset; see subpc.select(). 'association': most correlated subset with phenotype if least=FALSE, or least correlated otherwise; see select.least.assoc().

dir

directory containing the filebacked.big.matrix, same as 'dir' for get.big.matrix.

rows

logical, whether to choose a subset of rows (TRUE), or columns (FALSE). rows is always TRUE when using 'association' methods.

random

logical, whether to use random selections and subsets (TRUE), or whether to use uniform selections that should give the same result each time for the same dataset (FALSE)

hi.cor

logical, if using 'correlation' methods, then whether to choose the most correlated (TRUE) or least correlated (FALSE).

least

logical, if using 'association' methods, whether to choose the least associated (TRUE) or most associated variables with phenotype

pref

character, a prefix for big.matrix backing files generated by this selection

verbose

logical, whether to display more information about processing

ret.obj

logical, whether to return the result as a big.matrix object (TRUE), or as a reference to the binary file containing the big.matrix.descriptor object [either can be read with get.big.matrix() or prv.big.matrix()]

...

other arguments to be passed to uniform.select, subpc.select, subcor.select, or select.least.assoc

Value

A smaller big.matrix with fewer rows and/or columns than the original matrix

Author(s)

Nicholas Cooper

See Also

uniform.select, subpc.select, subcor.select, select.least.assoc, big.select, get.big.matrix

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
orig.dir <- getwd(); setwd(tempdir()); # move to temporary dir
if(file.exists("thin.bck")) { unlink(c("thin.bck","thin.dsc")) }
bmat <- generate.test.matrix(5,big.matrix=TRUE)
prv.big.matrix(bmat)
# make 5% random selection:
lmat <- thin(bmat, pref="th2")
prv.big.matrix(lmat)
# make 10% most orthogonal selection (lowest correlations):
lmat <- thin(bmat,.10,"cor",hi.cor=FALSE, pref="th3")
prv.big.matrix(lmat)
# make 10% most representative selection:
lmat <- thin(bmat,.10,"PCA",ret.obj=FALSE, pref="th4") # return file name instead of object
print(lmat)
prv.big.matrix(lmat)
# make 25% selection most correlated to phenotype
# create random phenotype variable
pheno <- rep(1,ncol(bmat)); pheno[which(runif(ncol(bmat))<.5)] <- 2
lmat <- thin(bmat,.25,"assoc",phenotype=pheno,least=FALSE,verbose=TRUE, pref="th5")
prv.big.matrix(lmat)
# tidy up temporary files:
rm(lmat) 
unlink(c("thin.bck","thin.dsc","thin.RData",paste0("th",2:5)))
setwd(orig.dir)

bigpca documentation built on Nov. 22, 2017, 1:02 a.m.