thin: Reduce one dimension of a large matrix in a strategic way
In bigpca: PCA, Transpose and Multicore Functionality for 'big.matrix' Objects

Description Usage Arguments Value Author(s) See Also Examples

Thin the rows (or columns) of a large matrix or big.matrix in order to reduce the size of the dataset while retaining important information. Percentage of the original size or a new number of rows/columns is selectable, and then there are four methods to choose the data subset. Simple uniform and random selection can be specified. Other methods look at the correlation structure of a subset of the data to derive non-arbitrary selections, using correlation, PCA, or association with a phenotype or some other categorical variable. Each of the four methods has a separate function in this package, which you can see for more information, this function is merely a wrapper to select one of the four.

1
2
3

thin(bigMat, keep = 0.05, how = c("uniform", "correlation", "pca",
  "association"), dir = "", rows = TRUE, random = TRUE, hi.cor = TRUE,
  least = TRUE, pref = "thin", verbose = FALSE, ret.obj = TRUE, ...)

`bigMat`	a big.matrix object, or any argument accepted by get.big.matrix(), which includes paths to description files or even a standard matrix object.
`keep`	numeric, by default a proportion (decimal) of the original number of rows/columns to choose for the subset. Otherwise if an integer>2 then will assume this is the size of the desired subset, e.g, for a dataset with 10,000 rows where you want a subset size of 1,000 you could set 'keep' as either 0.1 or 1000.
`how`	character, only the first two characters are required and they are not case sensitive, select what method to use to perform subset selection, options are: 'uniform': evenly spaced selection when random=FALSE, or random selection otherwise; see uniform.select(). 'correlation': most correlated subset when hi.cor=TRUE, least correlated otherwise; see subcor.select(). 'pca': most representative variables of the principle components of a subset; see subpc.select(). 'association': most correlated subset with phenotype if least=FALSE, or least correlated otherwise; see select.least.assoc().
`dir`	directory containing the filebacked.big.matrix, same as 'dir' for get.big.matrix.
`rows`	logical, whether to choose a subset of rows (TRUE), or columns (FALSE). rows is always TRUE when using 'association' methods.
`random`	logical, whether to use random selections and subsets (TRUE), or whether to use uniform selections that should give the same result each time for the same dataset (FALSE)
`hi.cor`	logical, if using 'correlation' methods, then whether to choose the most correlated (TRUE) or least correlated (FALSE).
`least`	logical, if using 'association' methods, whether to choose the least associated (TRUE) or most associated variables with phenotype
`pref`	character, a prefix for big.matrix backing files generated by this selection
`verbose`	logical, whether to display more information about processing
`ret.obj`	logical, whether to return the result as a big.matrix object (TRUE), or as a reference to the binary file containing the big.matrix.descriptor object [either can be read with get.big.matrix() or prv.big.matrix()]
`...`	other arguments to be passed to uniform.select, subpc.select, subcor.select, or select.least.assoc

A smaller big.matrix with fewer rows and/or columns than the original matrix

Nicholas Cooper

uniform.select, subpc.select, subcor.select, select.least.assoc, big.select, get.big.matrix

orig.dir <- getwd(); setwd(tempdir()); # move to temporary dir
if(file.exists("thin.bck")) { unlink(c("thin.bck","thin.dsc")) }
bmat <- generate.test.matrix(5,big.matrix=TRUE)
prv.big.matrix(bmat)
# make 5% random selection:
lmat <- thin(bmat, pref="th2")
prv.big.matrix(lmat)
# make 10% most orthogonal selection (lowest correlations):
lmat <- thin(bmat,.10,"cor",hi.cor=FALSE, pref="th3")
prv.big.matrix(lmat)
# make 10% most representative selection:
lmat <- thin(bmat,.10,"PCA",ret.obj=FALSE, pref="th4") # return file name instead of object
print(lmat)
prv.big.matrix(lmat)
# make 25% selection most correlated to phenotype
# create random phenotype variable
pheno <- rep(1,ncol(bmat)); pheno[which(runif(ncol(bmat))<.5)] <- 2
lmat <- thin(bmat,.25,"assoc",phenotype=pheno,least=FALSE,verbose=TRUE, pref="th5")
prv.big.matrix(lmat)
# tidy up temporary files:
rm(lmat) 
unlink(c("thin.bck","thin.dsc","thin.RData",paste0("th",2:5)))
setwd(orig.dir)