R/meta.summarize.R

Defines functions meta.summarize

Documented in meta.summarize

#' @title Summarize (concatenate) all predictions of a \code{LTRpred.meta} run
#' @description Crawl through all genome predictions performed with \code{\link{LTRpred.meta}}
#' and concatenate the prediction files for each species in the meta result folder
#' generated by \code{\link{LTRpred.meta}} to a meta-species \code{data.frame}.
#' @param result.folder path to meta result folder generated by \code{\link{LTRpred.meta}}.
#' @param ltr.similarity only count elements that have an LTR similarity >= this threshold.
#' @param quality.filter optimize search to remove potential false positives (e.g. duplicated genes, etc.). See \code{Details} for further information on the filter criteria.
#' @param n.orfs minimum number of Open Reading Frames that must be found between the LTRs (if \code{quality.filter = TRUE}). See \code{Details} for further information on quality control.
#' @param strategy quality filter strategy. Options are
#' \itemize{
#' \item \code{strategy = "default"} : see section \code{Quality Control} 
#' \item \code{strategy = "stringent"} : in addition to filter criteria specified in section \code{Quality Control},
#' the filter criteria \code{!is.na(protein_domain)) | (dfam_target_name != "unknown")} is applied
#' }
#' @author Hajk-Georg Drost
#' @details
#' This function crawls through each genome stored in the meta result folder
#' generated by \code{\link{LTRpred.meta}} and performs the following procedures:
#'
#' \itemize{
#' \item \strong{Step 1:} For each genome: Read the \code{*._LTRpred_DataSheet.csv} file generated by \code{\link{LTRpred}}.
#' \item \strong{Step 2:} For each genome: Perform quality filtering and selection of elements having at least \code{ltr.similarity} sequence similarity between their LTRs (if \code{quality.filter = TRUE}). Otherwise no quality filtering is performed.
#' \item \strong{Step 3:} Summarize all genome predictions in the meta-folder to one meta-species \code{data.frame}.
#' }
#'
#' \strong{Quality Filtering}
#'
#' The aim of the quality filtering step is to reduce the potential false positive
#' LTR transposons that were predicted by \code{\link{LTRpred}}. These false positives can be
#' duplicated genes, or other homologous repetitive elements that fulfill the LTR similarity
#' criteria, but do not have any Primer Binding Site, Open Reading Frames, Gag and Pol
#' proteins, etc. To reduce the number of false positives, the following filters are applied
#' to discard false positive LTR transposons.
#'
#' \itemize{
#' \item \code{ltr.similarity}: Minimum similarity between LTRs. All TEs not matching this
#'  criteria are discarded.
#'  \item \code{n.orfs}: minimum number of Open Reading Frames that must be found between the
#'   LTRs. All TEs not matching this criteria are discarded.
#'  \item \code{PBS or Protein Match}: elements must either have a predicted Primer Binding
#'  Site or a protein match of at least one protein (Gag, Pol, Rve, ...) between their LTRs. All TEs not matching this criteria are discarded.
#' }
#' @return a \code{LTRpred.tbl} storing the \code{\link{LTRpred}} prediction \code{data.frames} for all species in the meta result folder generated by \code{\link{LTRpred.meta}}.
#'
#'
#' @export

meta.summarize <- function(result.folder, 
                           ltr.similarity = 70, 
                           quality.filter = TRUE,
                           n.orfs         = 0,
                           strategy = "default"){
  
    result.files <- list.files(result.folder)
    folders0 <-
        result.files[stringr::str_detect(result.files, "ltrpred")]
    org.list <- vector("list", length(folders0))
    ltr_similarity <- NULL
    
    for (i in 1:length(folders0)) {
        choppedFolder <- unlist(stringr::str_split(folders0[i], "_"))
        pred <- read.ltrpred(file.path(
            result.folder,
            folders0[i],
            paste0(
                paste0(choppedFolder[-length(choppedFolder)], collapse = "_"),
                "_LTRpred_DataSheet.tsv"
            )
        ))
        
        if (quality.filter) {
            pred.filtered <-
              quality.filter(pred, 
                             sim = ltr.similarity, 
                             n.orfs = n.orfs,
                             strategy = strategy)
        }
        
        if (!quality.filter) {
            message("No quality filter was applied ...")
            pred.filtered <-
                dplyr::filter(pred, ltr_similarity >= ltr.similarity)
        }
        
        if (nrow(pred.filtered) > 0) {
            org.list[i] <-
                list(as.data.frame(pred.filtered))
        } else {
            warning("When filtering sim = ",ltr.similarity," in ",stringr::str_replace(folders0[i], "_ltrpred", ""),", no entries could be found anymore.")
        }
    } 
    
  return(dplyr::bind_rows(org.list))
}
HajkD/LTRpred documentation built on April 22, 2022, 4:35 p.m.