ipcaps: Iterative Pruning to Capture Population Structure

Documented in ipcaps

#' Perform unsupervised clustering to capture population structure based on iterative pruning
#'
#' @description This version supports ordinal data which can be applied directly
#' to SNP data to identify fine-scale population structure. It was built on the
#' iterative pruning Principal Component Analysis (ipPCA) algorithm
#' (Intarapanich et al., 2009; Limpiti et al., 2011). The IPCAPS involves an
#' iterative process using multiple splits based on multivariate Gaussian
#' mixture modeling of principal components and Clustering EM estimation (Lebret
#' et al., 2015). In each iteration, rough clusters and outliers are also
#' identified using our own method called rubikclust (R package \pkg{KRIS}).
#'
#' @param bed A PLINK binary format consists of 3 files; bed, bim, and fam. To
#' generate these files from PLINK, use option –make-bed. See more details at:
#' \href{http://zzz.bwh.harvard.edu/plink/data.shtml}{harvard.edu}.
#' @param rdata In case of re-analysis, it is convenient to run IPCAPS using the
#' file rawdata.RData generated by IPCAPS. This file contains a matrix of SNPs
#' (raw.data) and a vector of labels (label).
#' @param files IPCAPS supports SNPs encoded as 0, 1 and 2 (dosage encoding).
#' Rows represent SNPs and columns represent subjects. Each column needs to be
#' separated by a space or a tab. A big text file should be divided into smaller
#' files to load faster. For instance, to input 3 files, use as: files=c(
#' 'input1.txt', 'input2.txt', 'input3.txt').
#' @param label.file An additional useful information (called "labels" in
#' IPCAPS) related subject, for example, geographic location or disease
#' phenotype. These labels (one at a time) are used in displaying the clustering
#' outcome of IPCAPS. A label file must contain at least one column. However,
#' it may contain more than one column in which case each column need to be
#' separated by a space or a tab.
#' @param lab.col The label in the label file to be used in the tree-like
#' display of IPCAPS clustering results.
#' @param out To set an absolute path for IPCAPS output. If the specified output
#' directory already exists, result files are saved in sub-directories
#' cluster_out, cluster_out1, cluster_out2, etc.
#' @param plot.as.pdf To export plots as PDF. When omitted, plots are saved as PNG.
#' @param method	The internal clustering method. It can be set to "mix"
#' (rubikclust & mixmod), "mixmod" (Lebret et al., 2015), "clara" (R: Clustering
#' Large Applications), "pam" (R: Partitioning Around Medoids (PAM) Object),
#' "meanshift" (Wang, 2016), "apcluster" (Bodenhofer et al., 2016), and "hclust"
#' (R: Hierarchical Clustering). Default = "mix".
#' @param missing Symbol used for missing genotypes. Default = NA.
#' @param covariate A file of covariates; one covariate per column. SNPs can be
#' adjusted for these covariates via regression modeling and residual
#' computation.
#' @param cov.col.first Refer to a covariate file, the first covariate to be
#' considered as confounding variable.
#' @param cov.col.last Refer to a covariate file, the last covariate to be
#' considered as confounding variable. All the variables in between the
#' cov.col.first and cov.col.last will be considered in the adjustment process.
#' @param threshold Cutoff value for EigenFit. Possible values range from 0.03
#' to 0.18. The smaller, the potentially finer the substructure can be. Default
#' = 0.18.
#' @param min.fst Minimum Fst between a pair of subgroups. Default = 0.0008.
#' @param min.in.group Minimum number of individuals to constitute a cluster or
#' subgroup. Default = 20.
#' @param no.plot No plot is generated if this option is TRUE. This option is
#' useful when the system does not support X Windows in the unix based system.
#' Default = FALSE.
#'
#' @return Returns the list object containing 2 internal objects;
#' output.dir as class character and cluster as class data.frame. The object
#' output.dir stores a result directory. The object cluster contains 4 columns,
#' group, node, label, and row.number. The column group contains the assigned
#' groups from IPCAPS. The column node contains node numbers in a tree as
#' illustrated in the HTML result files. The column label contains the given
#' labels that is set to the parameter label. The column row.number contains
#' indices to an input data, which is matched to row numbers of input matrix.
#' All clustering result files are saved in one directory (as specified by out)
#' containing assigned groups, hierarchical trees of group membership, PC plots,
#' and binary data for further analysis.
#' \itemize{
#' \item \code{$snp} is a SNP matrix from BED file.
#' \item \code{$snp.info} is a data.frame of SNP information from BIM file.
#' \item \code{$ind.info} is a data.frame of individual information from FAM file.
#' }
#' If function return \code{NULL}, it means the input files are not in proper
#' format.
#'
#' @details The computational time depends on the number of individuals.
#' Consequentially, if large data sets are analyzed, it may be necessary first
#' to split data into smaller files, and then load into computer memory (with
#'  parameter 'files'). Internally, the split files are merged to construct
#'  a com-prehensive matrix.
#'
#' @export
#'
#' @include preprocess.R
#' @include postprocess.R
#' @include process.each.node.R
#'
#' @md
#'
#' @references
#'
#' Bodenhofer, U., Palme, J., Melkonian, C., and Kothmeier, A. (2016). apcluster
#' : Affinity Propagation Clustering. Available at \href{ https://CRAN.R-project.org/package=apcluster}{CRAN} (Accessed March 7, 2017).
#'
#' Intarapanich, A., Shaw, P. J., Assawamakin, A., Wangkumhang, P., Ngamphiw, C.
#' , Chaichoompu, K., et al. (2009). Iterative pruning PCA improves resolution
#' of highly structured populations. BMC Bioinformatics 10, 382. doi:10.1186/
#' 1471-2105-10-382.
#'
#' Lebret, R., Iovleff, S., Langrognet, F., Biernacki, C., Celeux, G., and
#' Govaert, G. (2015). Rmixmod: TheRPackage of the Model-Based Unsupervised,
#' Supervised, and Semi-Supervised ClassificationMixmodLibrary. J. Stat. Softw.
#' 67. doi:10.18637/jss.v067.i06.
#'
#' Limpiti, T., Intarapanich, A., Assawamakin, A., Shaw, P. J., Wangkumhang, P.,
#' Piriyapongsa, J., et al. (2011). Study of large and highly stratified
#' population datasets by combining iterative pruning principal component
#' analysis and structure. BMC Bioinformatics 12, 255. doi:10.1186/1471-2105-12-
#' 255.
#'
#' Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., and Hornik, K. (2017).
#' cluster: Cluster Analysis Basics and Extensions.
#'
#' R: Clustering Large Applications Available at: \href{https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/clara.html}{ethz.ch} (Accessed March 7, 2017).
#'
#' R Core Team (2017). R: A Language and Environment for Statistical Computing.
#' Vienna, Austria: R Foundation for Statistical Computing Available at:
#' \href{https://www.R-project.org/}{r-project.org}.
#'
#' R: Hierarchical Clustering Available at: \href{https://stat.ethz.ch/R-manual/R-devel/library/stats/html/hclust.html}{ethz.ch} (Accessed March 7, 2017).
#'
#' R: Partitioning Around Medoids (PAM) Object Available at: \href{https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/pam.object.html}{ethz.ch} (Accessed
#' March 7, 2017).
#'
#' Wang, M. C. and D. (2016). MeanShift: Clustering via the Mean Shift
#' Algorithm. Available at: \href{https://CRAN.R-project.org/package=MeanShift}{CRAN}
#' (Accessed March 7, 2017).
#'
#' @examples
#'
#' #Use the BED format as input
#' #Importantly, bed file, bim file, and fam file are required
#' #Use the example files embedded in the package
#'
#' BED.file <- system.file("extdata", "ipcaps_example.bed", package = "IPCAPS")
#' LABEL.file <- system.file("extdata", "ipcaps_example_individuals.txt.gz",
#'                           package = "IPCAPS")
#' my.cluster1 <- ipcaps(bed = BED.file, label.file = LABEL.file, lab.col = 2,
#' out = tempdir())
#'
#' table(my.cluster1$cluster$label, my.cluster1$cluster$group)
#'
#' # Alternatively, use a text file as input
#' # Use the example files embedded in the package
#'
#' #text.file <- system.file("extdata", "ipcaps_example_rowVar_colInd.txt.gz",
#' #                          package="IPCAPS")
#' #LABEL.file <- system.file("extdata", "ipcaps_example_individuals.txt.gz",
#' #                           package="IPCAPS")
#'
#' #my.cluster2 <- ipcaps(files = c(text.file), label.file = LABEL.file, lab.col = 2,
#' #                       out=tempdir())
#' #table(my.cluster2$cluster$label, my.cluster2$cluster$group)
#'
#' # The other alternative way, use an R Data file as input
#' # Use the example file embedded in the package
#'
#' #rdata.file <- system.file("extdata", "ipcaps_example.rda", package = "IPCAPS")
#'
#' #my.cluster3 <- ipcaps(rdata = rdata.file, out = tempdir())
#' #table(my.cluster3$cluster$label, my.cluster3$cluster$group)
#'
ipcaps <- function( bed = NA, rdata = NA, files = NA, label.file = NA,
                    lab.col = 1, out, plot.as.pdf = FALSE, method = 'mix',
                    missing = NA, covariate = NA, cov.col.first = NA,
                    cov.col.last = NA, threshold = 0.18, min.fst = 0.0008,
                    min.in.group = 20, no.plot = FALSE){

  filename.label = label.file
  label.column = lab.col
  result.dir = out
  rerun = FALSE
  rdata.infile = rdata
  bed.infile = bed
  datatype = 'snp'
  nonlinear = FALSE
  missing.char = missing
  regression.file = covariate
  regression.col.first = cov.col.first
  regression.col.last = cov.col.last
  file.list = files
  max.thread = 1
  cate.list = NA

  start.time <- Sys.time()

  if (length(threshold)<=0){
    threshold=0.18 #work in general. The lowest value is 0.03 for the data without outlier
  }

  if (length(min.fst)<=0){
    min.fst=0.0008
  }

  usage= paste0("Usage: ?ipcaps\n")

  if (length(result.dir)==0){
    cat(usage)
    quit()
  }
  if ((length(filename.label)==0 || is.na(file.list)) && (length(rdata.infile)==0) &&
      (length(bed.infile)==0) && is.na(cate.list)){
    cat(usage)
    quit()
  }

  cat(paste0("Running ... IPCAPS \n\toutput: ",result.dir," \n"))

  if (length(filename.label)>0){
    cat(paste0("\tlabel file: ",filename.label,"\n"))
  }

  if (length(label.column)>0){
    if (is.na(as.integer(label.column))){
      #except only comma as separator, otherwise
      if (length(strsplit(label.column,',')[[1]])){
        if (anyNA(as.integer(strsplit(label.column,',')[[1]]))){
          label.column = 1
          cat(paste0("\tlabel column: ",label.column,"\n"))
        }else{
          label.column = as.integer(strsplit(label.column,',')[[1]])
        }
      }else{
        label.column = 1
        cat(paste0("\tlabel column: ",label.column,"\n"))
      }
    }else{
      label.column = as.integer(label.column)
      cat(paste0("\tlabel column: ",label.column,"\n"))
    }
  }else{
    label.column = 1
  }

  if (length(threshold)>0){
    cat(paste0("\tthreshold: ",threshold,"\n"))
  }

  if (length(min.fst)>0){
    cat(paste0("\tminimum Fst: ",min.fst,"\n"))
  }

  if (length(max.thread)<=0){
    max.thread=NA
  }
  if (is.na(max.thread)){
    max.thread = 1
  }else{
    if (max.thread < 1){
      max.thread = 1
    }
  }
  cat(paste0("\tmaximum thread: ",max.thread,"\n"))

  if (length(method)>0){
    cat(paste0("\tmethod: ",method,"\n"))
  }else{
    method="mix"
  }

  if (!is.na(rdata.infile)){
    if (file.exists(rdata.infile)){
      cat(paste0("\trdata: ",rdata.infile,"\n"))
    }else{
      rdata.infile = NA
    }
  }else{
    rdata.infile = NA
  }

  if (!is.na(bed.infile)){
    if (file.exists(bed.infile)){
      cat(paste0("\tbed: ",bed.infile,"\n"))
    }else{
      bed.infile = NA
    }
  }else{
    bed.infile = NA
  }

  if (!is.na(file.list)){
    cat(paste0("\tfiles: \n"))
    print(file.list)
  }

  if (plot.as.pdf){
    cat(paste0("\tplot.as.pdf: TRUE\n"))
    plot.as.pdf=TRUE
  }else{
    plot.as.pdf=FALSE
  }

  if (length(min.in.group)>0){
    if (min.in.group < 5){
      min.in.group = 5
    }
    cat(paste0("\tminimum in group: ",min.in.group,"\n"))
  }else{
    min.in.group=20
  }

  if (length(datatype)>0){
    cat(paste0("\tdata type: ",datatype,"\n"))
  }else{
    datatype="snp"
  }

  if (length(missing.char)>0){
    cat(paste0("\tmissing: ",missing.char,"\n"))
  }else{
    missing.char = NA
  }

  if (length(regression.file)>0 && !is.na(regression.file)){
    cat(paste0("\tcovariate file: ",regression.file,"\n"))
  }else{
    regression.file = NA
  }

  if (!is.na(regression.col.first) && regression.col.first>0){
    cat(paste0("\tcovariate first column: ",regression.col.first,"\n"))
  }else{
    regression.col.first = NA
  }

  if (!is.na(regression.col.last) && regression.col.last>0){
    cat(paste0("\tcovariate last column: ",regression.col.last,"\n"))
  }else{
    regression.col.last = NA
  }


  #preprocessing step
  cat(paste0("In preprocessing step\n"))

  result.dir=preprocess(files=file.list, label.file=filename.label, lab.col=label.column,
                        rdata.infile=rdata.infile, bed.infile=bed.infile, cate.list=cate.list,
                        result.dir=result.dir, threshold=threshold, min.fst=min.fst,
                        reanalysis=rerun, method=method, min.in.group=min.in.group,datatype=datatype,
                        nonlinear=nonlinear, missing.char=missing.char,
                        regression.file=regression.file, regression.col.first=regression.col.first,
                        regression.col.last=regression.col.last,plot.as.pdf=plot.as.pdf,no.plot=no.plot)

  if (is.null(result.dir)){
    return(NULL)
  }

  #job scheduler
  cat(paste0("Start calculating\n"))

  #create the first task for Node 1
  #status 0 = unprocessed, 1 = processing, 2 = done , -1 = deleted

  node = c(1)
  parent.node = c(0)
  status = c(0)

  tree = data.frame(node,parent.node,status)
  file.name = file.path(result.dir,"RData","tree.RData")
  save(tree,file=file.name, compress = 'bzip2')

  while (TRUE){

    file.name = file.path(result.dir,"RData","tree.RData")
    load(file.name)

    #check for terminate loop
    row2 = which(tree$status==2)
    row_1 = which(tree$status==-1)
    if ((length(row2)+length(row_1))==length(tree$node)){
      break
    }

    file.name = file.path(result.dir,"RData","condition.RData")
    load(file=file.name)

    #take one node to process
    which.row = which(tree$status==0)
    if (length(which.row)>0){
      exe.node.list = tree$node[which.row]
      for(idx in exe.node.list) {
        process.each.node(node=idx,work.dir=result.dir)
      }

    }else{
      break
    }

  }

  cat(paste0("In post process step\n"))
  cluster.tab = postprocess(result.dir=result.dir)

  end.time <- Sys.time()
  run.time = as.numeric(end.time-start.time, units = "secs")
  cat(paste0("Total runtime is ",run.time," sec\n"))
  file.name = file.path(result.dir,"RData","runtime.RData")
  save(run.time,file=file.name, compress = 'bzip2')
  cluster.obj <- list("output.dir"=result.dir,"cluster"=cluster.tab)
  file.name = file.path(result.dir,"RData","result.RData")
  save(cluster.obj,file=file.name, compress = 'bzip2')
  return(cluster.obj)
}
# Check the result files in your output directory
# groups.txt contains the assigned groups of samples
# tree_text.html is the text result shown as binary tree
# tree_scatter_cluster.html is the scatter plots colored by clustering result
# tree_scatter_label.html is the scatter plots colored by labels
# tree_scree.html is the scree plots of eigen values