roe: R utilities for alevin-fry

Documented in make_splici_txome

#' Construct splici transcriptome
#'
#' Construct the splici transcriptome for alevin-fry.
#'
#' @param genome_path Character scalar providing the path to a genome fasta
#'   file.
#' @param gtf_path Character scalar providing the path to a gtf file.
#' @param read_length Numeric scalar indicating the read length.
#' @param flank_trim_length Numeric scalar giving the flank trimming length. The
#'   final flank length is obtained by subtracting the \code{flank_trim_length}
#'   from the \code{read_length}.
#' @param output_dir Character scalar indicating the output directory, where
#'   the splici transcriptome files will be written.
#' @param filename_prefix Character scalar giving the file name prefix of the
#'   generated output files. The derived flank length will be automatically
#'   appended to the provided prefix.
#' @param extra_spliced Character scalar providing the path to a fasta file
#'   with additional sequences to include among the spliced ones.
#' @param extra_unspliced Character scalar providing the path to a fasta file
#'   with additional sequences to include among the unspliced ones.
#' @param dedup_seqs Logical scalar indicating whether or not to remove
#'   duplicate sequences.#' 
#' @param no_flanking_merge Logical scalar indicating whether or not to 
#'   merge overlapping introns caused by adding flanking length.
#' @param quiet logical whether to display no messages
#'
#' @author Dongze He
#'
#'
#' @details
#' 
#' The term \emph{splici} is shorthand for spliced + intronic reference sequence.   
#' This reference sequence is prepared by extracting the spliced transcripts from the reference genome according to  the desired annotation (e.g. unfiltered, or filtered for certain classes of transcripts),as well as the collapsed intervals corresponding to the introns of genes. 
#' 
#' The intronic sequences of the \emph{splici} reference play important roles in the various kinds of experiments discussed in the manuscript of \code{alevin-fry}.
#' For single-cell RNA-seq data, although one typically focuses on the  fragments and UMIs arising from the (spliced) transcriptome,and only considers the spliced and ambiguous counts when performing downstream analyses, the intronic sequences act similarly to decoy sequences proposed by Srivastava 2020 et al. 
#' They account for fragments that might otherwise selectively-align or pseudoalign to the transcriptome with lower quality, but that, in fact, derive from some unspliced RNA transcript molecule. 
#' \emph{splici} improves the accuracy and false discovery of mapping, as the intronic sequences of \code{splici} act as the decoy sequence to absorb reads map to the spliced transcriptome with low quality.
#' For single-nucleus RNA-seq and RNA velocity analysis, the usage is straightforward, as the intronic sequences enable \code{alevin-fry} to detect the signals from unspliced, immature RNA transcripts, which are crucial in those analyses.
#' 
#' To construct the \emph{splici} reference, one provides a genome fasta file and a gene annotation GTF file of the  target species, as well as the read length of the experiments  to be processed. 
#' Here we describe the procedure to extract the  \emph{splici} sequence of one gene; the procedure will be applied for  all genes specified in the provided GTF file.   
#' First, the spliced transcript sequences are extracted for all isoforms of the gene. 
#' Next, the intronic regions of each isoform of  the gene are extracted.  
#' These intronic regions consist of the  full length of the unspliced transcript, subtracting out any  exonic sequence.  
#' If alternative \emph{splicing} positions occur within  exons, certain sub-intervals of a gene can appear as part of both  spliced transcripts and as intronic sequence. 
#' Additionally, each  extracted intronic interval is extended by the provided \emph{flanking length} at its starting and ending position. 
#' By adding this \emph{flanking  length}, the reference can account for reads that map to the junction  of an exon and an intron.  
#' Otherwise these reads cannot be confidently  mapped to either the unflanked intronic region or to the spliced  isoforms of the gene. 
#' 
#' The intronic regions of different isoforms often overlap considerably.   
#' To avoid repetitive indexing of shared sub-intervals, the flanked  intronic regions across all isoforms of a gene will be combined if  they share overlapping sub-regions.  
#' As the result, each unique  genomic sequence will appear only once in the combined intronic  regions of the gene.  
#' The genomic sequences of those combined  intronic regions will then be extracted as the intronic part of this gene. 
#' These combined intronic sequences will each be given  a unique name, and all will be added to the \emph{splici}  reference.
#' 
#' The process described above will be applied to all genes  defined in the provided GTF file.  
#' These sequences will be  combined with any custom (user-provided) sequences, for example, one can include the sequence of  mitochondrial genes to avoid spurious mapping further. 
#' The resulting sequences constitute the \emph{splici} reference. 
#' Note that sometimes one genomic region may belong to more than one gene. 
#' In this case, extracted sequences can appear multiple times in \code{splici} reference with different names. 
#' As duplicated sequences will anyway appear only once in the \code{salmon} index except if one specifically sets the \code{-{}-keepDuplicates} flag (which we have not), we did not explicitly deduplicate repeated sequences across genes when constructing the \emph{splici} reference.  
#' However, this function accepts an optional argument, \code{dedup_seqs} that will perform this identical sequence deduplication during reference construction. 
#' 
#' Currently, the \emph{splici} index depends on both the reference genome and annotation to be indexed, as well as the length of the reads that will be mapped against this index.  
#' The read length is used to determine an appropriate flanking size (default of read length - 5) to add to the ends of the extracted collapsed intron sequences.  
#' The read length used for construction need not exactly match those being mapped, and we have evaluated that the index is reasonably robust to similar read lengths and degrades gracefully as the indexed and mapped read lengths diverge.  
#' Further, the short-read sequencing by synthesis most commonly employed for single-cell and single-nucleus experiments comprises a small set of characteristic read lengths.  
#' Nonetheless, by simply constructing the index with a single flanking length that is as large as the longest reads to be considered, and by propagating the relative splice junction annotations to the mapping step, it should be possible for the index to be constructed independently of the read length parameter.  
#' We are currently exploring the optimal implementation of this enhancement.
#' 
#' @references
#' Srivastava et al. (2020).
#' Alignment and mapping methodology influence transcript abundance estimation
#' \url{https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02151-8}
#' 
#' He et al. (2021)
#' Alevin-fry unlocks rapid, accurate, and memory-frugal quantification of single-cell RNA-seq data
#' \url{https://www.nature.com/articles/s41592-022-01408-3}
#' 
#'
#'
#' @export
#'
#' @return Nothing is returned, but the necessary files for the splici
#'   reference are written to the designated \code{output_dir}.
#'
#' @examples
#' 
#' library(roe)
#' 
#' genome_path <- system.file("extdata/small_example_genome.fa", package = "roe")
#' gtf_path <- system.file("extdata/small_example.gtf", package = "roe")
#' extra_spliced = system.file("extdata/extra_spliced.txt", package = "roe")
#' extra_unspliced = system.file("extdata/extra_unspliced.txt", package = "roe")
#' output_dir = tempdir()
#' read_length=5
#' flank_trim_length = 2
#' filename_prefix = "splici"
#' 
#' # run the function
#' make_splici_txome(genome_path = genome_path,
#'                   gtf_path = gtf_path,
#'                   read_length = read_length,
#'                   output_dir = output_dir,
#'                   flank_trim_length = flank_trim_length,
#'                   filename_prefix = filename_prefix,
#'                   extra_spliced = extra_spliced,
#'                   extra_unspliced = extra_unspliced,
#'                   dedup_seqs = TRUE,
#'                   no_flanking_merge = FALSE
#' ) 
#' 
#' # grep the output filenames
#' grep("transcriptome_splici_fl", dir(output_dir), value = TRUE)
#' 


make_splici_txome <- function(genome_path,
                              gtf_path,
                              read_length,
                              output_dir,
                              flank_trim_length = 5,
                              filename_prefix = "splici",
                              extra_spliced = NULL,
                              extra_unspliced = NULL,
                              dedup_seqs = FALSE,
                              no_flanking_merge = FALSE,
                              quiet = FALSE
                              ) {

  suppressWarnings(.make_splici_txome(genome_path = genome_path,
                                      gtf_path = gtf_path,
                                      read_length = read_length,
                                      flank_trim_length = flank_trim_length,
                                      output_dir = output_dir,
                                      filename_prefix = filename_prefix,
                                      extra_spliced = extra_spliced,
                                      extra_unspliced = extra_unspliced,
                                      dedup_seqs = dedup_seqs,
                                      no_flanking_merge = no_flanking_merge,
                                      quiet = quiet
                                      )
  )
}

#' @importFrom eisaR getFeatureRanges getTx2Gene
#' @importFrom Biostrings readDNAStringSet writeXStringSet
#' @import BSgenome
#' @importFrom stringr str_detect word
#' @importFrom GenomicFeatures makeTxDbFromGFF genes extractTranscriptSeqs
#' @importFrom BiocGenerics unlist relist start end
#' @importFrom GenomicRanges reduce trim
#' @importFrom S4Vectors mcols metadata split
#' @importFrom GenomeInfoDb seqlevels seqlengths
#' @importFrom Biobase rowMin rowMax
#' @importFrom rlang .data
#' @importFrom utils write.table

.make_splici_txome <- function(genome_path,
                                gtf_path,
                                read_length,
                                output_dir,
                                flank_trim_length = 5,
                                filename_prefix = "splici",
                                extra_spliced = NULL,
                                extra_unspliced = NULL,
                                dedup_seqs = FALSE,
                                no_flanking_merge = FALSE,
                                quiet = FALSE
                                ) {
  ## TODO: Add this sentence somewhere in the documentation:
  # flank trim length is used to avoid marginal case when
  # dealing with junctional reads

  ############################################################################
  # Preprocessing
  ############################################################################

  .say(quiet, "- Locating required files...")
  if (!dir.exists(output_dir)) {
    dir.create(file.path(output_dir), recursive = TRUE,
                showWarnings = FALSE)
  }
  # make sure flank_length makes sense
  flank_length <- read_length - flank_trim_length
  if (flank_length < 0) {
    stop("The flank trim length must be smaller than the read length!")
  }
  # make sure gtf file exists
  if (!file.exists(gtf_path)) {
    stop("The following file does not exist: \n", gtf_path)
  }
  # make sure fasta file exists
  if (!file.exists(genome_path)) {
    stop("The following file does not exist: \n", genome_path)
  }

  # output file names
  filename_prefix <- paste0(filename_prefix, "_fl", flank_length)
  out_fa <- file.path(output_dir, paste0(filename_prefix, ".fa"))
  # out_t2g <- file.path(output_dir, 
  #                      paste0(filename_prefix, "_t2g.tsv"))
  out_t2g3col <- file.path(output_dir, paste0(filename_prefix,
                                              "_t2g_3col.tsv"))
  out_dup <- file.path(output_dir, "duplicate_entries.tsv")
  out_name2id <- file.path(output_dir, paste0("gene_id_to_name.tsv"))
  .say(quiet, "- Processing spliced transcripts and introns...")

  .say(quiet, "  - Loading the input files")
  # load the genome sequence
  x <- Biostrings::readDNAStringSet(file.path(genome_path))
  # get the first word as the name
  names(x) <- stringr::word(names(x), 1)

  ############################################################################
  # Process gtf to get spliced transcripts and introns
  ############################################################################

  suppressMessages({
    suppressWarnings({
      grl <- eisaR::getFeatureRanges(
        gtf = file.path(gtf_path),
        featureType = c("spliced", "intron"),
        intronType = "separate",
        flankLength = 0,
        joinOverlappingIntrons = TRUE,
        verbose = FALSE
    )})
  })

  ############################################################################
  # Get spliced related stuff
  ############################################################################

  .say(quiet, "  - Processing spliced transcripts")
  spliced_idx <- names(grl) %in% S4Vectors::metadata(grl)$featurelist$spliced
  spliced_grl <- grl[spliced_idx]

  spliced_t2g_df <- eisaR::getTx2Gene(spliced_grl, filepath = out_name2id)
  spliced_t2g_3col_df <- spliced_t2g_df
  spliced_t2g_3col_df$splice_status <- "S"

  ############################################################################
  # Get reduced introns
  ############################################################################
  .say(quiet, "  - Processing introns")

  # identify all introns and convert to GRanges
  intron_idx <- names(grl) %in% S4Vectors::metadata(grl)$featurelist$intron
  intron_gr <- BiocGenerics::unlist(grl[intron_idx])
  # pre-flanking merge
  if (no_flanking_merge) {
    intron_gr <- .add_metadata(intron_gr, x = x)
  }

  # add flanking length to each side
  intron_gr_flanked <- intron_gr + flank_length

  if (!no_flanking_merge) {
    intron_gr_flanked <- .add_metadata(intron_gr_flanked, x =x)
  }

  intron_gr_flanked$gene_id <- stringr::str_extract(intron_gr_flanked$gene_id, "^.*(?=\\-)")
  # remake intron GRangesList
  intron_grl <- BiocGenerics::relist(intron_gr_flanked,
                    lapply(structure(seq_along(intron_gr_flanked),
                                    names = intron_gr_flanked$transcript_id
                          ),
                          function(i) i
                    )
                )

  # make sure introns don't out of boundary
  suppressWarnings({
    GenomeInfoDb::seqlevels(intron_grl) <- GenomeInfoDb::seqlevels(x)
    GenomeInfoDb::seqlengths(intron_grl) <- GenomeInfoDb::seqlengths(x)
    intron_grl <- GenomicRanges::trim(intron_grl)
  })

  unspliced_t2g_df <- eisaR::getTx2Gene(intron_grl)
  unspliced_t2g_3col_df <- unspliced_t2g_df
  unspliced_t2g_3col_df$splice_status <- "U"
  # unspliced_t2g_3col_df$gene_id <- stringr::str_extract(unspliced_t2g_3col_df$gene_id, "^.*(?=\\-)")

  ############################################################################
  # extract sequences from genome
  ############################################################################
  .say(quiet, "  - Extracting sequences from the genome")

  grl <- c(spliced_grl, intron_grl)

  seqs <- GenomicFeatures::extractTranscriptSeqs(
    x = x,
    transcripts = grl
  )

  # If having duplicated sequences, only keep one
  if (dedup_seqs) {
    seqs <- .dedup_sequences(seqs, out_dup = out_dup)
    grl <- grl[names(seqs)]
  }

  # save some space
  rm(x)

  ############################################################################
  # process final outputs
  ############################################################################
  .say(quiet, "- Writing outputs")

  # df <- rbind(spliced_t2g_df, unspliced_t2g_df)
  # utils::write.table(df, out_t2g, sep = "\t", row.names = FALSE,
  #                    quote = FALSE, col.names = FALSE)
  .say(quiet, "  - Writing spliced and intron sequences")

  df_3col <- rbind(spliced_t2g_3col_df, unspliced_t2g_3col_df)
  Biostrings::writeXStringSet(seqs, out_fa, format = "fasta")
  utils::write.table(df_3col, out_t2g3col
                      , sep = "\t",
                      row.names = FALSE, quote = FALSE, col.names = FALSE)

  # optional: adding extra spliced and unspliced sequences from an fasta file
  if (!is.null(extra_spliced)) {
    .say(quiet, "  - Appending extra spliced sequences")

    if (!file.exists(extra_spliced)) {
      warning("  - Provided extra_sequences file does not exist, ignored.")
    } else {
      fa <- file(extra_spliced, open = "r")
      lns <- readLines(fa)
      close(fa)
      for (ln in lns) {
        if (startsWith(ln, ">")) {
          # it is a header, write to t2g files and fasta file
          txp_name <- gsub(">", "", ln)
          # utils::write.table(matrix(c(txp_name, txp_name), nrow = 1),
          #                    file = out_t2g, sep = "\t",
          #                    row.names = FALSE, quote = FALSE,
          #                    col.names = FALSE, append = TRUE)
          utils::write.table(matrix(c(txp_name, txp_name, "S"),
                                    nrow = 1),
                              file = out_t2g3col, sep = "\t",
                              row.names = FALSE, quote = FALSE,
                              col.names = FALSE, append = TRUE)
          utils::write.table(ln, file = out_fa, sep = "\t",
                              row.names = FALSE, quote = FALSE,
                              col.names = FALSE, append = TRUE)
          } else {
          # if not a header, just write to fasta file
          utils::write.table(ln, file = out_fa, sep = "\t",
                              row.names = FALSE, quote = FALSE,
                              col.names = FALSE, append = TRUE)
        }
      }
    }
  }

  if (!is.null(extra_unspliced)) {
    .say(quiet, "  - Appending extra unspliced sequences")

    if (!file.exists(extra_unspliced)) {
      warning("  - Provided extra_sequences file does not exist, ignored.")
    } else {
      fa <- file(extra_unspliced, open="r")
      lns <- readLines(fa)
      close(fa)
      for (ln in lns) {
        if (startsWith(ln, ">")) {
          # it is a header, write to t2g file and fasta file
          txp_name <- gsub(">", "", ln)
          # utils::write.table(matrix(c(txp_name,
          #                             paste0(txp_name, "-U")),
          #                           nrow = 1), file = out_t2g,
          #                    sep = "\t",
          #                    row.names = FALSE, quote = FALSE,
          #                    col.names = FALSE, append = TRUE)
          utils::write.table(matrix(c(txp_name, txp_name, "U"),
                                    nrow = 1),
                              file = out_t2g3col, sep = "\t",
                              row.names = FALSE, quote = FALSE,
                              col.names = FALSE, append = TRUE)
          utils::write.table(ln, file = out_fa, sep = "\t",
                              row.names = FALSE, quote = FALSE,
                              col.names = FALSE, append = TRUE)
        } else {
          # if not a header, just write to fasta file
          utils::write.table(ln, file = out_fa, sep = "\t",
                              row.names = FALSE, quote = FALSE,
                              col.names = FALSE, append = TRUE)
        }
      }
    }
  }
  .say(quiet, "Done")
}

# This function takes a GRanage object and its genome, then returns
.add_metadata <- function(intron_gr, x) {
  # group introns by gene, then collapse overlapping ranges
  intron_grl <- GenomicRanges::reduce(S4Vectors::split(intron_gr,
                                                        intron_gr$gene_id))
  intron_gr <- BiocGenerics::unlist(intron_grl)

  # clean txp names and gene names
  intron_gr$exon_rank <- 1L
  intron_gr$type <- "exon"
  ## TODO: Revisit this
  intron_gr$transcript_id <- stringr::word(names(intron_gr), 1, sep = '-')
  intron_gr$gene_id <- intron_gr$transcript_id
  intron_gr$transcript_id <- make.unique(paste0(intron_gr$transcript_id,
                                                "-I"),
                                        sep = "")
  intron_gr$gene_id <- paste0(intron_gr$gene_id, "-I")
  intron_gr$exon_id <- intron_gr$transcript_id
  ## --
  names(intron_gr) <- NULL
  S4Vectors::mcols(intron_gr) <-
    S4Vectors::mcols(intron_gr)[, c("exon_id", "exon_rank",
                                    "transcript_id", "gene_id", "type")]


  # make sure introns don't out of boundary
  suppressWarnings({
    GenomeInfoDb::seqlevels(intron_gr) <- GenomeInfoDb::seqlevels(x)
    GenomeInfoDb::seqlengths(intron_gr) <- GenomeInfoDb::seqlengths(x)
    intron_gr <- GenomicRanges::trim(intron_gr)
  })

  intron_gr
}

.dedup_sequences <- function(seqs, out_dup) {
  # sort seqs based on names
  # so that unique() will keep the
  # one with the smallest lex order
  seqs <- seqs[sort(names(seqs))]

  # save some work, only build list for duplicated items
  # get all non-unique seqs
  # the representative is not included
  duplicated_seqs <- seqs[duplicated(seqs)]

  # this is what we will return
  unique_seqs <- unique(seqs)
  rm(seqs)

  # we find all the names for each duplicated seq
  record_representatives <- list()
  while (length(duplicated_seqs) > 0) {
    # find the the elements that have
    # the same seq with the first element
    dup_idx <- which(duplicated_seqs == duplicated_seqs[1])
    dup_names <- names(duplicated_seqs)[dup_idx]

    # find the representative seq in the unqie_seqs
    repr_name <- names(unique_seqs)[unique_seqs == duplicated_seqs[1]]

    # record them
    record_representatives[[repr_name]] <- dup_names

    # remove them from duplicated_seqs
    duplicated_seqs <- duplicated_seqs[-dup_idx]
  }

  # dump dropped names as salmon
  out_df <- data.frame(RetainedRef = rep(names(record_representatives),
                                          sapply(record_representatives,
                                                length)
                                      ),
                        DuplicateRef = unlist(record_representatives)
            )
  write.table(out_df, file = out_dup, quote = FALSE,
            sep = "\t", row.names = FALSE, col.names = TRUE)

  # return unique seqs
  return(unique_seqs)
}