R/workflowPartialMatch.R

Defines functions workflowPartialMatch

Documented in workflowPartialMatch

#' Finds the section in a long sequence that better matches a short sequence.
#'
#' @description This workflow works under the following scenario: the user has a short sequence, and a long sequence, and has the objective of finding the segment in the long sequence that better matches the short sequence. The function identifies automatically the short and the long sequence, but throws an error if more than two sequences are introduced. The lengths of the segments in the long sequence to be compared with the long sequence are defined through the arguments \code{min.length} and \code{max.length}. If left empty, \code{min.length} and \code{max.length} equal 0, meaning that the segment to be searched for will have the same number of cases as the short sequence. Note that this is a brute force algorithm, can have a large memory footpring if the interval between \code{min.length} and \code{max.length} is too long. It might be convenient to pre-check the number of iterations to be performed by computing \code{sum(nrow(long.sequence) - min.length:max.length) + 1}. The algorithm is parallelized and optimized as possible, so still, large searches are possible.

#'
#' @usage workflowPartialMatch(
#'   sequences = NULL,
#'   grouping.column = NULL,
#'   time.column = NULL,
#'   exclude.columns = NULL,
#'   method = "manhattan",
#'   diagonal = FALSE,
#'   paired.samples = FALSE,
#'   min.length = NULL,
#'   max.length = NULL,
#'   ignore.blocks = FALSE,
#'   parallel.execution = TRUE
#'   )
#'
#' @param sequences dataframe with multiple sequences identified by a grouping column generated by \code{\link{prepareSequences}}.
#' @param grouping.column character string, name of the column in \code{sequences} to be used to identify separates sequences within the file.
#' @param time.column character string, name of the column with time/depth/rank data.
#' @param exclude.columns character string or character vector with column names in \code{sequences} to be excluded from the analysis.
#' @param method character string naming a distance metric. Valid entries are: "manhattan", "euclidean", "chi", and "hellinger". Invalid entries will throw an error.
#' @param diagonal boolean, if \code{TRUE} (default), diagonals are included in the computation of the least cost path. This is the best option if the user suspects that a given segment in the short sequence might be identical to the short sequence.
#' @param paired.samples boolean, if \code{TRUE}, the sequences are assumed to be aligned, and distances are computed for paired-samples only (no distance matrix required). Default value is \code{FALSE}.
#' @param min.length integer, minimum length (in rows) of the subsets of the long sequence to be matched against the short sequence. If \code{NULL} (default), the subset of the long sequence to be matched will thave the same number of samples as the short sequence.
#' @param max.length  integer, maximum length (in rows) of the subsets of the long sequence to be matched against the short sequence. If \code{NULL} (default), the subset of the long sequence to be matched will thave the same number of samples as the short sequence.
#' @param ignore.blocks boolean. If \code{TRUE}, the function \code{\link{leastCostPathNoBlocks}} analyzes the least-cost path of the best solution, and removes blocks (straight-orthogonal sections of the least-cost path), which happen in highly dissimilar sections of the sequences, and inflate output psi values.
#' @param parallel.execution boolean, if \code{TRUE} (default), execution is parallelized, and serialized if \code{FALSE}.
#'
#' @return A dataframe with three columns:
#' \itemize{
#' \item \emph{first.row} first row of the segment in the long sequence matched against the short one.
#' \item \emph{last.row} last row of the segment in the long sequence matched against the short one.
#' \item \emph{psi} psi values, ordered from lower (máximum similarity / minimum dissimilarity) to higher.
#' }
#'
#' @author Blas Benito <blasbenito@gmail.com>
#'
#' @examples
#'
#' \donttest{
#'
#'#loading the data
#'data(sequencesMIS)
#'
#'#removing grouping column
#'sequencesMIS$MIS <- NULL
#'
#'#mock-up short sequence
#'MIS.short <- sequencesMIS[1:10, ]
#'
#'#mock-up long sequence
#'MIS.long <- sequencesMIS[1:30, ]
#'
#'#preparing sequences
#'MIS.sequences <- prepareSequences(
#'  sequence.A = MIS.short,
#'  sequence.A.name = "short",
#'  sequence.B = MIS.long,
#'  sequence.B.name = "long",
#'  grouping.column = "id",
#'  transformation = "hellinger"
#'  )
#'
#'#matching sequences
#'#min.length and max.length are
#'#minimal to speed up execution
#'MIS.psi <- workflowPartialMatch(
#'  sequences = MIS.sequences,
#'  grouping.column = "id",
#'  time.column = NULL,
#'  exclude.columns = NULL,
#'  method = "manhattan",
#'  diagonal = FALSE,
#'  parallel.execution = FALSE
#'  )
#'
#'#output dataframe
#'MIS.psi
#'
#'}
#'
#' @export
workflowPartialMatch <- function(
  sequences = NULL,
  grouping.column = NULL,
  time.column = NULL,
  exclude.columns = NULL,
  method = "manhattan",
  diagonal = FALSE,
  paired.samples = FALSE,
  min.length = NULL,
  max.length = NULL,
  ignore.blocks = FALSE,
  parallel.execution = TRUE
){

  #1 TESTING NUMBER OF GROUPS
  if(is.null(grouping.column)){
    if("id" %in% colnames(sequences)){
      grouping.column = "id" #default behavior of prepareSequences
    } else {
      stop("The argument 'grouping.column' must be a column name of the 'sequences' dataset.")
    }
  }
  if(!(grouping.column %in% colnames(sequences))){
    stop("The argument 'grouping.column' is not a column of the 'sequences' dataset.")
  }
  if(length(unique(sequences[, grouping.column])) > 2){
    stop("More than two groups were found in 'grouping.column'. This workflow only works with two sequences!")
  }


  #2 COMPUTING DISTANCE MATRIX
  if(is.null(paired.samples) | paired.samples == FALSE){

  distance.matrix <- distanceMatrix(
      sequences = sequences,
      grouping.column = grouping.column,
      time.column = time.column,
      exclude.columns = exclude.columns,
      method = method,
      parallel.execution = FALSE
    )

  } else {

    #creates dummy distance.matrix object, because it needs to be passed as a variable to a cluster
    distance.matrix <- NULL
    diagonal <- FALSE
    min.length <- NULL
    max.length <- NULL

  }

  #3. PREPARING COMBINATIONS OF SAMPLES OF THE LONG SEQUENCE
  #guessing names of the short and long sequences
  table.groups <- table(sequences[, grouping.column])
  sequences.long.name <- names(table.groups)[which.max(table.groups)]
  sequences.short.name <- names(table.groups)[which.min(table.groups)]

  #minimum and maximum length of the matching sequences
  if((is.null(min.length) & is.null(max.length)) | paired.samples == TRUE){
    min.length <- max.length <- table.groups[sequences.short.name]
    names(min.length) <- names(max.length) <- NULL
  } else {
    min.length <- floor(min(c(min.length, max.length)))
    max.length <- floor(max(c(min.length, max.length)))
  }

  #vector of lengths (sizes of subsets in the long sequence)
  target.lengths <- min.length:max.length

  #number of combinations
  # n.combinations <- sum(table.groups[sequences.long.name] - target.lengths) + 1

  #subsetting the sequences
  sequences.long <- sequences[sequences[, grouping.column] == sequences.long.name, ]
  sequences.short <- sequences[sequences[, grouping.column] == sequences.short.name, ]

  #nrow of sequences.long
  sequences.long.nrow <- nrow(sequences.long)

  #indexes
  first.row <- last.row <- vector()
  vectors.i <- 0

  #iterating through lengths
  for(target.length.i in target.lengths){

    #iterating through rows in the long sequences
    for(target.row.i in 1:(sequences.long.nrow - target.length.i)){

      #advancing one position in the indexes
      vectors.i <- vectors.i + 1

      #getting first row of the subset
      first.row[vectors.i] <- target.row.i

      #getting last row of the subset
      last.row[vectors.i] <- min(c(target.row.i + (target.length.i - 1), sequences.long.nrow))

    }
  }
  if(length(first.row) != length(last.row)){
    stop("Something went wrong with the subsetting...")
  } else {
      n.iterations <- length(first.row)
  }


  #5 starting cluster
  #parallel execution = TRUE
  if(parallel.execution == TRUE){
    `%dopar%` <- foreach::`%dopar%`
    n.cores <- parallel::detectCores() - 1
    if(n.iterations < n.cores){n.cores <- n.iterations}
    my.cluster <- parallel::makeCluster(n.cores, type="FORK")
    doParallel::registerDoParallel(my.cluster)

    #exporting cluster variables
    parallel::clusterExport(cl=my.cluster,
                            varlist=c('distance.matrix',
                                      'sequences.long',
                                      'sequences.long.nrow',
                                      'first.row',
                                      'last.row',
                                      'sequences.short',
                                      'sequences.long.name',
                                      'sequences.short.name',
                                      'paired.samples',
                                      'diagonal',
                                      'grouping.column',
                                      'time.column',
                                      'method',
                                      'distancePairedSamples',
                                      'ignore.blocks'),
                            envir=environment()
    )
  } else {
    #replaces dopar (parallel) by do (serial)
    `%dopar%` <- foreach::`%do%`
    on.exit(`%dopar%` <- foreach::`%dopar%`)
  }

  #parallelized loop
  psi.values <- foreach::foreach(i = 1:n.iterations) %dopar% {

    #subset rows
    subset.rows <- first.row[i]:last.row[i]

    #SAMPLES ARE NOT PAIRED: ELASTIC METHOD
    if(paired.samples == FALSE){

      #subsetting distance matrix
      distance.matrix.subset <- list()
      distance.matrix.subset[[1]] <- distance.matrix[[1]][, subset.rows]
      names(distance.matrix.subset) <- names(distance.matrix)

      #computing least cost matrix
      least.cost.matrix <- leastCostMatrix(
        distance.matrix = distance.matrix.subset,
        diagonal = diagonal,
        parallel.execution = FALSE
      )

      #computing least cost path
      least.cost.path <- leastCostPath(
        distance.matrix = distance.matrix.subset,
        least.cost.matrix = least.cost.matrix,
        diagonal = diagonal,
        parallel.execution = FALSE
      )

      #BLOCKS ARE NOT IGNORED
      if(ignore.blocks == FALSE){

        least.cost <- leastCost(
          least.cost.path = least.cost.path,
          parallel.execution = FALSE
        )

      } else {

        #BLOCKS ARE IGNORED
        #computing least cost path
        least.cost.path <- leastCostPathNoBlocks(
          least.cost.path = least.cost.path,
          parallel.execution = FALSE
        )

        #getting least cost ignoring blocks
        least.cost <- leastCost(
          least.cost.path = least.cost.path,
          parallel.execution = FALSE
        )

      }

      #autosum
      autosum.sequences <- autoSum(
        sequences = sequences,
        least.cost.path = least.cost.path,
        grouping.column = grouping.column,
        time.column = time.column,
        exclude.columns = exclude.columns,
        method = method,
        parallel.execution = FALSE
      )

      #computing psi
      psi.value <- psi(
        least.cost = least.cost,
        autosum = autosum.sequences,
        parallel.execution = FALSE
      )

      #shifting psi by 1
      if(diagonal == TRUE){
        psi.value <- lapply(X = psi.value, FUN = function(x){x + 1})
      }

    } #end of paired.samples == FALSE

    #if paired samples is TRUE
    if(paired.samples == TRUE){

      least.cost <- distancePairedSamples(
        sequences = rbind(sequences.short, sequences.long[subset.rows, ]),
        grouping.column = grouping.column,
        time.column = time.column,
        exclude.columns = exclude.columns,
        method = method,
        sum.distances = TRUE,
        parallel.execution = FALSE
      )

      #autosum
      autosum.sequences <- autoSum(
        sequences = rbind(sequences.short, sequences.long[subset.rows, ]),
        least.cost.path = least.cost,
        grouping.column = grouping.column,
        time.column = time.column,
        exclude.columns = exclude.columns,
        method = method,
        parallel.execution = FALSE
      )

      #computing psi
      psi.value <- psi(
        least.cost = least.cost,
        autosum = autosum.sequences,
        parallel.execution = parallel.execution
      )

      #shifting psi by 1
      psi.value <- lapply(X = psi.value, FUN = function(x){x + 1})

    } #end of paired.samples == TRUE

    return(psi.value[[1]])

  } #end of parallelized loop

  #stopping cluster
  if(parallel.execution == TRUE){
    parallel::stopCluster(my.cluster)
  } else {
    #creating the correct alias again
    `%dopar%` <- foreach::`%dopar%`
  }

  #getting output dataframe ready
  output.df <- data.frame(first.row = first.row, last.row = last.row, psi = unlist(psi.values))

  #ordering
  output.df <- output.df[order(output.df$psi, decreasing = FALSE), ]
  rownames(output.df) <- 1:nrow(output.df)

  return(output.df)

}

Try the distantia package in your browser

Any scripts or data that you put into this service are public.

distantia documentation built on Oct. 30, 2019, 10:05 a.m.