#' Perform unsupervised clustering to capture population structure based on iterative pruning
#'
#' @description This version supports ordinal data which can be applied directly
#' to SNP data to identify fine-scale population structure. It was built on the
#' iterative pruning Principal Component Analysis (ipPCA) algorithm
#' (Intarapanich et al., 2009; Limpiti et al., 2011). The IPCAPS involves an
#' iterative process using multiple splits based on multivariate Gaussian
#' mixture modeling of principal components and Clustering EM estimation (Lebret
#' et al., 2015). In each iteration, rough clusters and outliers are also
#' identified using our own method called rubikclust (R package \pkg{KRIS}).
#'
#' @param bed A PLINK binary format consists of 3 files; bed, bim, and fam. To
#' generate these files from PLINK, use option –make-bed. See more details at:
#' \href{http://zzz.bwh.harvard.edu/plink/data.shtml}{harvard.edu}.
#' @param rdata In case of re-analysis, it is convenient to run IPCAPS using the
#' file rawdata.RData generated by IPCAPS. This file contains a matrix of SNPs
#' (raw.data) and a vector of labels (label).
#' @param files IPCAPS supports SNPs encoded as 0, 1 and 2 (dosage encoding).
#' Rows represent SNPs and columns represent subjects. Each column needs to be
#' separated by a space or a tab. A big text file should be divided into smaller
#' files to load faster. For instance, to input 3 files, use as: files=c(
#' 'input1.txt', 'input2.txt', 'input3.txt').
#' @param label.file An additional useful information (called "labels" in
#' IPCAPS) related subject, for example, geographic location or disease
#' phenotype. These labels (one at a time) are used in displaying the clustering
#' outcome of IPCAPS. A label file must contain at least one column. However,
#' it may contain more than one column in which case each column need to be
#' separated by a space or a tab.
#' @param lab.col The label in the label file to be used in the tree-like
#' display of IPCAPS clustering results.
#' @param out To set an absolute path for IPCAPS output. If the specified output
#' directory already exists, result files are saved in sub-directories
#' cluster_out, cluster_out1, cluster_out2, etc.
#' @param plot.as.pdf To export plots as PDF. When omitted, plots are saved as PNG.
#' @param method The internal clustering method. It can be set to "mix"
#' (rubikclust & mixmod), "mixmod" (Lebret et al., 2015), "clara" (R: Clustering
#' Large Applications), "pam" (R: Partitioning Around Medoids (PAM) Object),
#' "meanshift" (Wang, 2016), "apcluster" (Bodenhofer et al., 2016), and "hclust"
#' (R: Hierarchical Clustering). Default = "mix".
#' @param missing Symbol used for missing genotypes. Default = NA.
#' @param covariate A file of covariates; one covariate per column. SNPs can be
#' adjusted for these covariates via regression modeling and residual
#' computation.
#' @param cov.col.first Refer to a covariate file, the first covariate to be
#' considered as confounding variable.
#' @param cov.col.last Refer to a covariate file, the last covariate to be
#' considered as confounding variable. All the variables in between the
#' cov.col.first and cov.col.last will be considered in the adjustment process.
#' @param threshold Cutoff value for EigenFit. Possible values range from 0.03
#' to 0.18. The smaller, the potentially finer the substructure can be. Default
#' = 0.18.
#' @param min.fst Minimum Fst between a pair of subgroups. Default = 0.0008.
#' @param min.in.group Minimum number of individuals to constitute a cluster or
#' subgroup. Default = 20.
#' @param no.plot No plot is generated if this option is TRUE. This option is
#' useful when the system does not support X Windows in the unix based system.
#' Default = FALSE.
#'
#' @return Returns the list object containing 2 internal objects;
#' output.dir as class character and cluster as class data.frame. The object
#' output.dir stores a result directory. The object cluster contains 4 columns,
#' group, node, label, and row.number. The column group contains the assigned
#' groups from IPCAPS. The column node contains node numbers in a tree as
#' illustrated in the HTML result files. The column label contains the given
#' labels that is set to the parameter label. The column row.number contains
#' indices to an input data, which is matched to row numbers of input matrix.
#' All clustering result files are saved in one directory (as specified by out)
#' containing assigned groups, hierarchical trees of group membership, PC plots,
#' and binary data for further analysis.
#' \itemize{
#' \item \code{$snp} is a SNP matrix from BED file.
#' \item \code{$snp.info} is a data.frame of SNP information from BIM file.
#' \item \code{$ind.info} is a data.frame of individual information from FAM file.
#' }
#' If function return \code{NULL}, it means the input files are not in proper
#' format.
#'
#' @details The computational time depends on the number of individuals.
#' Consequentially, if large data sets are analyzed, it may be necessary first
#' to split data into smaller files, and then load into computer memory (with
#' parameter 'files'). Internally, the split files are merged to construct
#' a com-prehensive matrix.
#'
#' @export
#'
#' @include preprocess.R
#' @include postprocess.R
#' @include process.each.node.R
#'
#' @md
#'
#' @references
#'
#' Bodenhofer, U., Palme, J., Melkonian, C., and Kothmeier, A. (2016). apcluster
#' : Affinity Propagation Clustering. Available at \href{ https://CRAN.R-project.org/package=apcluster}{CRAN} (Accessed March 7, 2017).
#'
#' Intarapanich, A., Shaw, P. J., Assawamakin, A., Wangkumhang, P., Ngamphiw, C.
#' , Chaichoompu, K., et al. (2009). Iterative pruning PCA improves resolution
#' of highly structured populations. BMC Bioinformatics 10, 382. doi:10.1186/
#' 1471-2105-10-382.
#'
#' Lebret, R., Iovleff, S., Langrognet, F., Biernacki, C., Celeux, G., and
#' Govaert, G. (2015). Rmixmod: TheRPackage of the Model-Based Unsupervised,
#' Supervised, and Semi-Supervised ClassificationMixmodLibrary. J. Stat. Softw.
#' 67. doi:10.18637/jss.v067.i06.
#'
#' Limpiti, T., Intarapanich, A., Assawamakin, A., Shaw, P. J., Wangkumhang, P.,
#' Piriyapongsa, J., et al. (2011). Study of large and highly stratified
#' population datasets by combining iterative pruning principal component
#' analysis and structure. BMC Bioinformatics 12, 255. doi:10.1186/1471-2105-12-
#' 255.
#'
#' Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., and Hornik, K. (2017).
#' cluster: Cluster Analysis Basics and Extensions.
#'
#' R: Clustering Large Applications Available at: \href{https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/clara.html}{ethz.ch} (Accessed March 7, 2017).
#'
#' R Core Team (2017). R: A Language and Environment for Statistical Computing.
#' Vienna, Austria: R Foundation for Statistical Computing Available at:
#' \href{https://www.R-project.org/}{r-project.org}.
#'
#' R: Hierarchical Clustering Available at: \href{https://stat.ethz.ch/R-manual/R-devel/library/stats/html/hclust.html}{ethz.ch} (Accessed March 7, 2017).
#'
#' R: Partitioning Around Medoids (PAM) Object Available at: \href{https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/pam.object.html}{ethz.ch} (Accessed
#' March 7, 2017).
#'
#' Wang, M. C. and D. (2016). MeanShift: Clustering via the Mean Shift
#' Algorithm. Available at: \href{https://CRAN.R-project.org/package=MeanShift}{CRAN}
#' (Accessed March 7, 2017).
#'
#' @examples
#'
#' #Use the BED format as input
#' #Importantly, bed file, bim file, and fam file are required
#' #Use the example files embedded in the package
#'
#' BED.file <- system.file("extdata", "ipcaps_example.bed", package = "IPCAPS")
#' LABEL.file <- system.file("extdata", "ipcaps_example_individuals.txt.gz",
#' package = "IPCAPS")
#' my.cluster1 <- ipcaps(bed = BED.file, label.file = LABEL.file, lab.col = 2,
#' out = tempdir())
#'
#' table(my.cluster1$cluster$label, my.cluster1$cluster$group)
#'
#' # Alternatively, use a text file as input
#' # Use the example files embedded in the package
#'
#' #text.file <- system.file("extdata", "ipcaps_example_rowVar_colInd.txt.gz",
#' # package="IPCAPS")
#' #LABEL.file <- system.file("extdata", "ipcaps_example_individuals.txt.gz",
#' # package="IPCAPS")
#'
#' #my.cluster2 <- ipcaps(files = c(text.file), label.file = LABEL.file, lab.col = 2,
#' # out=tempdir())
#' #table(my.cluster2$cluster$label, my.cluster2$cluster$group)
#'
#' # The other alternative way, use an R Data file as input
#' # Use the example file embedded in the package
#'
#' #rdata.file <- system.file("extdata", "ipcaps_example.rda", package = "IPCAPS")
#'
#' #my.cluster3 <- ipcaps(rdata = rdata.file, out = tempdir())
#' #table(my.cluster3$cluster$label, my.cluster3$cluster$group)
#'
ipcaps <- function( bed = NA, rdata = NA, files = NA, label.file = NA,
lab.col = 1, out, plot.as.pdf = FALSE, method = 'mix',
missing = NA, covariate = NA, cov.col.first = NA,
cov.col.last = NA, threshold = 0.18, min.fst = 0.0008,
min.in.group = 20, no.plot = FALSE){
filename.label = label.file
label.column = lab.col
result.dir = out
rerun = FALSE
rdata.infile = rdata
bed.infile = bed
datatype = 'snp'
nonlinear = FALSE
missing.char = missing
regression.file = covariate
regression.col.first = cov.col.first
regression.col.last = cov.col.last
file.list = files
max.thread = 1
cate.list = NA
start.time <- Sys.time()
if (length(threshold)<=0){
threshold=0.18 #work in general. The lowest value is 0.03 for the data without outlier
}
if (length(min.fst)<=0){
min.fst=0.0008
}
usage= paste0("Usage: ?ipcaps\n")
if (length(result.dir)==0){
cat(usage)
quit()
}
if ((length(filename.label)==0 || is.na(file.list)) && (length(rdata.infile)==0) &&
(length(bed.infile)==0) && is.na(cate.list)){
cat(usage)
quit()
}
cat(paste0("Running ... IPCAPS \n\toutput: ",result.dir," \n"))
if (length(filename.label)>0){
cat(paste0("\tlabel file: ",filename.label,"\n"))
}
if (length(label.column)>0){
if (is.na(as.integer(label.column))){
#except only comma as separator, otherwise
if (length(strsplit(label.column,',')[[1]])){
if (anyNA(as.integer(strsplit(label.column,',')[[1]]))){
label.column = 1
cat(paste0("\tlabel column: ",label.column,"\n"))
}else{
label.column = as.integer(strsplit(label.column,',')[[1]])
}
}else{
label.column = 1
cat(paste0("\tlabel column: ",label.column,"\n"))
}
}else{
label.column = as.integer(label.column)
cat(paste0("\tlabel column: ",label.column,"\n"))
}
}else{
label.column = 1
}
if (length(threshold)>0){
cat(paste0("\tthreshold: ",threshold,"\n"))
}
if (length(min.fst)>0){
cat(paste0("\tminimum Fst: ",min.fst,"\n"))
}
if (length(max.thread)<=0){
max.thread=NA
}
if (is.na(max.thread)){
max.thread = 1
}else{
if (max.thread < 1){
max.thread = 1
}
}
cat(paste0("\tmaximum thread: ",max.thread,"\n"))
if (length(method)>0){
cat(paste0("\tmethod: ",method,"\n"))
}else{
method="mix"
}
if (!is.na(rdata.infile)){
if (file.exists(rdata.infile)){
cat(paste0("\trdata: ",rdata.infile,"\n"))
}else{
rdata.infile = NA
}
}else{
rdata.infile = NA
}
if (!is.na(bed.infile)){
if (file.exists(bed.infile)){
cat(paste0("\tbed: ",bed.infile,"\n"))
}else{
bed.infile = NA
}
}else{
bed.infile = NA
}
if (!is.na(file.list)){
cat(paste0("\tfiles: \n"))
print(file.list)
}
if (plot.as.pdf){
cat(paste0("\tplot.as.pdf: TRUE\n"))
plot.as.pdf=TRUE
}else{
plot.as.pdf=FALSE
}
if (length(min.in.group)>0){
if (min.in.group < 5){
min.in.group = 5
}
cat(paste0("\tminimum in group: ",min.in.group,"\n"))
}else{
min.in.group=20
}
if (length(datatype)>0){
cat(paste0("\tdata type: ",datatype,"\n"))
}else{
datatype="snp"
}
if (length(missing.char)>0){
cat(paste0("\tmissing: ",missing.char,"\n"))
}else{
missing.char = NA
}
if (length(regression.file)>0 && !is.na(regression.file)){
cat(paste0("\tcovariate file: ",regression.file,"\n"))
}else{
regression.file = NA
}
if (!is.na(regression.col.first) && regression.col.first>0){
cat(paste0("\tcovariate first column: ",regression.col.first,"\n"))
}else{
regression.col.first = NA
}
if (!is.na(regression.col.last) && regression.col.last>0){
cat(paste0("\tcovariate last column: ",regression.col.last,"\n"))
}else{
regression.col.last = NA
}
#preprocessing step
cat(paste0("In preprocessing step\n"))
result.dir=preprocess(files=file.list, label.file=filename.label, lab.col=label.column,
rdata.infile=rdata.infile, bed.infile=bed.infile, cate.list=cate.list,
result.dir=result.dir, threshold=threshold, min.fst=min.fst,
reanalysis=rerun, method=method, min.in.group=min.in.group,datatype=datatype,
nonlinear=nonlinear, missing.char=missing.char,
regression.file=regression.file, regression.col.first=regression.col.first,
regression.col.last=regression.col.last,plot.as.pdf=plot.as.pdf,no.plot=no.plot)
if (is.null(result.dir)){
return(NULL)
}
#job scheduler
cat(paste0("Start calculating\n"))
#create the first task for Node 1
#status 0 = unprocessed, 1 = processing, 2 = done , -1 = deleted
node = c(1)
parent.node = c(0)
status = c(0)
tree = data.frame(node,parent.node,status)
file.name = file.path(result.dir,"RData","tree.RData")
save(tree,file=file.name, compress = 'bzip2')
while (TRUE){
file.name = file.path(result.dir,"RData","tree.RData")
load(file.name)
#check for terminate loop
row2 = which(tree$status==2)
row_1 = which(tree$status==-1)
if ((length(row2)+length(row_1))==length(tree$node)){
break
}
file.name = file.path(result.dir,"RData","condition.RData")
load(file=file.name)
#take one node to process
which.row = which(tree$status==0)
if (length(which.row)>0){
exe.node.list = tree$node[which.row]
for(idx in exe.node.list) {
process.each.node(node=idx,work.dir=result.dir)
}
}else{
break
}
}
cat(paste0("In post process step\n"))
cluster.tab = postprocess(result.dir=result.dir)
end.time <- Sys.time()
run.time = as.numeric(end.time-start.time, units = "secs")
cat(paste0("Total runtime is ",run.time," sec\n"))
file.name = file.path(result.dir,"RData","runtime.RData")
save(run.time,file=file.name, compress = 'bzip2')
cluster.obj <- list("output.dir"=result.dir,"cluster"=cluster.tab)
file.name = file.path(result.dir,"RData","result.RData")
save(cluster.obj,file=file.name, compress = 'bzip2')
return(cluster.obj)
}
# Check the result files in your output directory
# groups.txt contains the assigned groups of samples
# tree_text.html is the text result shown as binary tree
# tree_scatter_cluster.html is the scatter plots colored by clustering result
# tree_scatter_label.html is the scatter plots colored by labels
# tree_scree.html is the scree plots of eigen values
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.