R/gl.blast.r
In dartR.popgen: Analysing 'SNP' and 'Silicodart' Data Generated by Genome-Wide Restriction Fragment Analysis

Documented in gl.blast

#' @name gl.blast
#'
#' @title Aligns nucleotides sequences against those present in a target database
#' using blastn
#'
#' @description Basic Local Alignment Search Tool (BLAST; Altschul et al., 1990 &
#'  1997) is a sequence comparison algorithm optimized for speed used to search
#'  sequence databases for optimal local alignments to a query. This function
#'  creates fasta files, creates databases to run BLAST, runs blastn and filters
#'  these results to obtain the best hit per sequence.
#'
#'  This function can be used to run BLAST alignment of short-read (DArTseq
#'  data) and long-read sequences (Illumina, PacBio... etc). You can use
#'  reference genomes from NCBI, genomes from your private collection, contigs,
#'  scaffolds or any other genetic sequence that you would like to use as
#'  reference.
#'
#' @param x Either a genlight object containing a column named
#'  'TrimmedSequence' containing the sequence of the SNPs (the sequence tag)
#'  trimmed of adapters as provided by DArT; or a path to a fasta file with the
#'  query sequences [required].
#' @param ref_genome Path to a reference genome in fasta of fna format
#'  [required].
#' @param task Four different tasks are supported: 1) “megablast”, for very
#'  similar sequences (e.g, sequencing errors), 2) “dc-megablast”, typically
#'  used for inter-species comparisons, 3) “blastn”, the traditional program
#'  used for inter-species comparisons, 4) “blastn-short”, optimized for
#'  sequences less than 30 nucleotides [default 'megablast'].
#' @param Percentage_identity Not a very sensitive or reliable measure of
#'  sequence similarity, however it is a reasonable proxy for evolutionary
#'  distance. The evolutionary distance associated with a 10 percent change in
#'  Percentage_identity is much greater at longer distances. Thus, a change from
#'  80 – 70 percent identity might reflect divergence 200 million years earlier
#'  in time, but the change from 30 percent to 20 percent might correspond to a
#'  billion year divergence time change [default 70].
#' @param Percentage_overlap Calculated as alignment length divided by the
#'  query length or subject length (whichever is shortest of the two lengths,
#'  i.e.  length / min(qlen,slen) ) [default 0.8].
#' @param bitscore A rule-of-thumb for inferring homology, a bit score of 50
#'  is almost always significant [default 50].
#' @param number_of_threads Number of threads (CPUs) to use in blastn search
#'  [default 2].
#' @param verbose verbose= 0, silent or fatal errors; 1, begin and end; 2,
#' progress log ; 3, progress and results summary; 5, full report
#' [default 2 or as specified using gl.set.verbosity]
#'
#' @details \strong{Installing BLAST}
#'
#'  You can download the BLAST installs from:
#'  \url{https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/}
#'
#'  It is important to install BLAST in a path that does not contain spaces for
#'  this function to work.
#'
#'  \strong{Running BLAST}
#'
#'  Four different tasks are supported: \itemize{ \item “megablast”, for very
#'  similar sequences (e.g, sequencing errors) \item “dc-megablast”, typically
#'  used for inter-species comparisons \item “blastn”, the traditional program
#'  used for inter-species comparisons \item “blastn-short”, optimized for
#'  sequences less than 30 nucleotides }
#'
#'  If  you  are  running  a  BLAST alignment of  similar  sequences,  for
#'  example  Turtle  Genome  Vs Turtle Sequences, the recommended parameters
#'  are: task = “megablast”, Percentage_identity = 70, Percentage_overlap =  0.8
#'  and bitscore = 50.
#'
#'  If you are running a BLAST alignment of highly dissimilar sequences because
#'  you are probably looking for sex linked  hits in  a distantly  related
#'  species,  and  you  are aligning for example sequences of Chicken Genome Vs
#'  Bassiana, the recommended parameters are: task = “dc-megablast”,
#'  Percentage_identity = 50, Percentage_overlap =  0.01 and bitscore = 30.
#'
#'  Be aware that running BLAST might take a long time (i.e. days) depending of
#'  the size of your query, the size of your database and the number of threads
#'  selected for your computer.
#'
#'  \strong{BLAST output}
#'
#'  The BLAST output is formatted as a table using output format 6, with columns
#'  defined in the following order: \itemize{ \item qseqid - Query Seq-id \item
#'  sacc - Subject accession \item stitle - Subject Title \item qseq - Aligned
#'  part of query sequence \item sseq - Aligned part of subject sequence \item
#'  nident - Number of identical matches \item mismatch - Number of mismatches
#'  \item pident - Percentage of identical matches \item length - Alignment
#'  length \item evalue - Expect value \item bitscore - Bit score \item qstart -
#'  Start of alignment in query \item qend - End of alignment in query \item
#'  sstart - Start of alignment in subject \item send - End of alignment in
#'  subject \item gapopen - Number of gap openings \item gaps - Total number of
#'  gaps \item qlen - Query sequence length \item slen - Subject sequence length
#'  \item PercentageOverlap - length / min(qlen,slen) }
#'
#'  Databases containing unfiltered aligned sequences, filtered aligned
#'  sequences and one hit per sequence are saved to the working directory
#'  (plot.dir tempdir if not set).
#'
#'
#'  \strong{BLAST filtering}
#'
#'  BLAST output is filtered by ordering the hits of each sequence first by the
#'  highest percentage identity, then the highest percentage overlap and then
#'  the highest bitscore. Only one hit per sequence is kept based on these
#'  selection criteria.
#'
#' @return If the input is a genlight object: returns a genlight object with one
#'  hit per sequence merged to the slot $other$loc.metrics. If the input is a
#'  fasta file: returns a dataframe with one hit per sequence.
#'
#' @author Berenice Talamantes Becerra & Luis Mijangos (Post to
#'  \url{https://groups.google.com/d/forum/dartr})
#'
#' @examples
#' \dontrun{
#' res <- gl.blast(x = testset.gl, ref_genome = "sequence.fasta")
#' # display of reports saved in the temporal directory
#' # open the reports saved in the temporal directory
#' }
#'
#' @references
#' \itemize{
#' \item Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D.
#'  J. (1990). Basic local alignment search tool. Journal of molecular biology,
#'  215(3), 403-410.
#' \item Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang,
#'  Z., Miller, W., & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new
#'  generation of protein database search programs. Nucleic acids research,
#'  25(17), 3389-3402.
#' \item Pearson, W. R. (2013). An introduction to sequence similarity
#'  (“homology”) searching. Current protocols in bioinformatics, 42(1), 3-1.
#'  }
#'
#' @seealso \code{\link[dartR.base]{gl.print.history}}
#'
#' @family reference genomes
#'
#' @export
#'

gl.blast <- function(x,
                     ref_genome,
                     task = "megablast",
                     Percentage_identity = 70,
                     Percentage_overlap = 0.8,
                     bitscore = 50,
                     number_of_threads = 2,
                     verbose = NULL) {
  # SET VERBOSITY
  verbose <- gl.check.verbosity(verbose)
  
  # FLAG SCRIPT START
  funname <- match.call()[[1]]
  utils.flag.start(func = funname,
                   build = "Jody",
                   verbose = verbose)
  
  # Check if the x@other$loc.metrics$TrimmedSequence slot exists
  if (class(x)[1] == "genlight"| class(x)[1] == "dartR" ) {
    if (is.null(x$other$loc.metrics$TrimmedSequence)) {
      stop(error(
        "\n\nFatal Error: TrimmedSequence column is required!.\n\n"
      ))
    }
  }
  
  # DO THE JOB
  
  # getting the query fasta files
  if (class(x)[1] == "genlight" | class(x)[1] == "dartR") {
    fasta.input <-
      c(rbind(
        paste("> ", 1:nLoc(x)),
        as.character(x$other$loc.metrics$TrimmedSequence)
      ))
    write.table(
      fasta.input,
      file = paste0(tempdir(), "/fasta.input"),
      quote = FALSE,
      row.names = FALSE,
      col.names = FALSE
    )
  } else {
    file.copy(
      from = x,
      to = paste0(tempdir(), "/fasta.input"),
      overwrite = TRUE
    )
  }
  
  # Find executable makeblastdb if unix
  if (grepl("unix", .Platform$OS.type, ignore.case = TRUE)) {
    path_makeblastdb <- Sys.which("makeblastdb")
  }
  ## if windows
  if (!grepl("unix", .Platform$OS.type, ignore.case = TRUE)) {
    path_makeblastdb <-
      tryCatch(
        system(sprintf("where %s", "makeblastdb"), intern = TRUE)[1],
        warning = function(w)
          "",
        error = function(e)
          ""
      )
    if (grepl("\\s", path_makeblastdb)) {
      stop(
        error(
          "  The path to the executable for makeblastdb has spaces. 
              Please move it\n to a path without spaces so BLAST can work.\n\n"
        )
      )
    }
  }
  
  ## if not found
  if (all(path_makeblastdb == "")) {
    stop(
      error(
        "  Executable for makeblastdb not found! Please make sure that 
                the software\n is correctly installed.\n\n"
      )
    )
  }
  
  # creating BLAST databases
  system(paste(
    path_makeblastdb,
    "-in",
    ref_genome,
    "-dbtype",
    "nucl",
    "-out",
    paste0(tempdir(), "/db_blast")
  ))
  
  # Find executable blastn if unix
  if (grepl("unix", .Platform$OS.type, ignore.case = TRUE)) {
    path_blastn <- Sys.which("blastn")
  }
  ## if windows
  if (!grepl("unix", .Platform$OS.type, ignore.case = TRUE)) {
    path_blastn <-
      tryCatch(
        system(sprintf("where %s", "blastn"), intern = TRUE)[1],
        warning = function(w)
          "",
        error = function(e)
          ""
      )
    if (grepl("\\s", path_blastn)) {
      stop(
        error(
          "The path to the executable for blastn has spaces. Please 
                    move it to a\n path without spaces so BLAST can work.\n\n"
        )
      )
    }
  }
  ## if not found
  if (all(path_blastn == "")) {
    stop(
      error(
        "Executable for blastn not found! Please make sure that the 
                software is\n correctly installed.\n\n"
      )
    )
  }
  
  if (verbose >= 1) {
    cat(report("Starting BLASTing\n\n"))
  }
  
  # BLASTing
  system(
    paste(
      path_blastn,
      " -task ",
      task,
      " -db ",
      paste0(tempdir(), "/db_blast"),
      " -query ",
      paste0(tempdir(), "/fasta.input"),
      " -out ",
      paste0(tempdir(), "/output_blast.txt"),
      " -perc_identity ",
      Percentage_identity,
      " -num_threads ",
      number_of_threads,
      " -outfmt ",
      "\"",
      6,
      " qseqid sacc stitle qseq sseq nident mismatch pident length evalue 
            bitscore qstart qend sstart send gapopen gaps qlen slen\""
    )
  )
  
  file_size <- file.info(paste0(tempdir(), "/output_blast.txt"))$size
  
  # reading file for filtering
  if (file_size > 0) {
    blast_res_unfiltered <-
      read.table(
        file = paste0(tempdir(), "/output_blast.txt"),
        header = FALSE,
        sep = "\t",
        quote = "\"",
        dec = ".",
        fill = TRUE,
        comment.char = "",
        stringsAsFactors = FALSE
      )
  } else {
    cat(report("No sequences were aligned\n\n"))
    return(x)
  }
  
  if (verbose >= 1) {
    cat(report("Starting filtering\n\n"))
  }
  
  colnames(blast_res_unfiltered) <-
    c(
      "qseqid",
      "sacc",
      "stitle",
      "qseq",
      "sseq",
      "nident",
      "mismatch",
      "pident",
      "length",
      "evalue",
      "bitscore",
      "qstart",
      "qend",
      "sstart",
      "send",
      "gapopen",
      "gaps",
      "qlen",
      "slen"
    )
  
  # calculate percentage overlap ratio of the alignment length divided by the
  # query length or subject length (whichever is shortest of
  # the two lengths)
  min_length_BF <-
    apply(blast_res_unfiltered[, c("qlen", "slen")], 1, min)
  blast_res_unfiltered$PercentageOverlap <-
    blast_res_unfiltered$length / min_length_BF
  # filtering first by percentage overlap and bitscore
  blast_res_filtered <-
    blast_res_unfiltered[which(
      blast_res_unfiltered$PercentageOverlap >= Percentage_overlap &
        blast_res_unfiltered$bitscore >=
        bitscore
    ),]
  # splitting hits by sequence
  all_hits <-
    split(x = blast_res_filtered, f = blast_res_filtered$qseqid)
  # ordering by first considering the highest percentage identity, then the 
  # highest percentage overlap, then the highest bitscore. Only
  # one hit per sequence is kept based on these selection criteria.
  one_hit_temp <- lapply(all_hits, function(x) {
    x[order(x$pident,
            x$PercentageOverlap,
            x$bitscore,
            decreasing = T),][1,]
  })
  
  one_hit <- plyr::rbind.fill(one_hit_temp)
  # merging one hit per sequence with genlight object
  if (class(x)[1] == "genlight"| class(x)[1] == "dartR"  ) {
    one_hit_temp <- x$other$loc.metrics
    one_hit_temp$qseqid <- 1:nLoc(x)
    if (!is.null(one_hit)) {
      x$other$loc.metrics <-
        merge(one_hit_temp,
              one_hit,
              by = "qseqid",
              all = T)
    }
  }
  
  if (verbose >= 1) {
    cat(report(paste(
      nrow(one_hit), " sequences were aligned after filtering"
    )))
  }
  
  match_call <-
    paste0(names(match.call()),
           "_",
           as.character(match.call()),
           collapse = "_")
  
  # creating file names
  temp_blast_unfiltered <- tempfile(pattern = "Blast_unfiltered_")
  temp_blast_filtered <- tempfile(pattern = "Blast_filtered_")
  temp_one_hit <- tempfile(pattern = "Blast_one_hit_")
  
  # saving to tempdir
  saveRDS(list(match_call, blast_res_unfiltered), 
          file = temp_blast_unfiltered)
  saveRDS(list(match_call, blast_res_filtered), file = temp_blast_filtered)
  saveRDS(list(match_call, one_hit), file = temp_one_hit)
  
  if (verbose >= 2) {
    cat(
      report(
        "  NOTE: Retrieve output files from tempdir using 
                gl.list.reports() and gl.print.reports()\n"
      )
    )
  }
  
  # ADD TO HISTORY
  if (class(x)[1] == "genlight" | class(x)[1] == "dartR" ) {
    nh <- length(x@other$history)
    x@other$history[[nh + 1]] <- match.call()
  }
  
  # FLAG SCRIPT END
  if (verbose >= 1) {
    cat(report("Completed:", funname, "\n\n"))
  }
  
  if (class(x)[1] == "genlight" | class(x)[1] == "dartR") {
    return(x)
  } else {
    return(one_hit)
  }
  
}
Any scripts or data that you put into this service are public.
dartR.popgen documentation built on June 28, 2024, 1:06 a.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
dartR.popgen
Analysing 'SNP' and 'Silicodart' Data Generated by Genome-Wide Restriction Fragment Analysis

R/gl.blast.r
In dartR.popgen: Analysing 'SNP' and 'Silicodart' Data Generated by Genome-Wide Restriction Fragment Analysis

Defines functions gl.blast

Documented in gl.blast

Try the dartR.popgen package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

dartR.popgen Analysing 'SNP' and 'Silicodart' Data Generated by Genome-Wide Restriction Fragment Analysis

R/gl.blast.r In dartR.popgen: Analysing 'SNP' and 'Silicodart' Data Generated by Genome-Wide Restriction Fragment Analysis

Defines functions gl.blast

Documented in gl.blast

Try the dartR.popgen package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

dartR.popgen
Analysing 'SNP' and 'Silicodart' Data Generated by Genome-Wide Restriction Fragment Analysis

R/gl.blast.r
In dartR.popgen: Analysing 'SNP' and 'Silicodart' Data Generated by Genome-Wide Restriction Fragment Analysis