get_velocity_files: Get files required for RNA velocity with bustools
In lambdamoses/BUStoolsR: kallisto | bustools R utilities

get_velocity_files

R Documentation

Get files required for RNA velocity with bustools

Description

Computation of RNA velocity requires the number of unspliced transcripts, which can be quantified with reads containing intronic sequences. This function extracts intronic sequences flanked by L-1 bases of exonic sequences where L is the biological read length of the single cell technology of interest. The flanking exonic sequences are included for reads partially mapping to an intron and an exon.

Usage

get_velocity_files(
  X,
  L,
  Genome,
  Transcriptome = NULL,
  out_path = ".",
  style = c("genome", "Ensembl", "UCSC", "NCBI", "other"),
  isoform_action = c("separate", "collapse"),
  exon_option = c("full", "junction"),
  compress_fa = FALSE,
  width = 80L,
  ...
)

## S4 method for signature 'GRanges'
get_velocity_files(
  X,
  L,
  Genome,
  Transcriptome = NULL,
  out_path = ".",
  style = c("genome", "Ensembl", "UCSC", "NCBI", "other"),
  isoform_action = c("separate", "collapse"),
  exon_option = c("full", "junction"),
  compress_fa = FALSE,
  width = 80L,
  transcript_id = "transcript_id",
  gene_id = "gene_id",
  transcript_version = "transcript_version",
  gene_version = "gene_version",
  version_sep = ".",
  transcript_biotype_col = "transcript_biotype",
  gene_biotype_col = "gene_biotype",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  save_filtered_gtf = FALSE
)

## S4 method for signature 'character'
get_velocity_files(
  X,
  L,
  Genome,
  Transcriptome = NULL,
  out_path = ".",
  style = c("genome", "Ensembl", "UCSC", "NCBI", "other"),
  isoform_action = c("separate", "collapse"),
  exon_option = c("full", "junction"),
  compress_fa = FALSE,
  width = 80L,
  is_circular = NULL,
  transcript_id = "transcript_id",
  gene_id = "gene_id",
  transcript_version = "transcript_version",
  gene_version = "gene_version",
  version_sep = ".",
  transcript_biotype_col = "transcript_biotype",
  gene_biotype_col = "gene_biotype",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  save_filtered_gtf = FALSE
)

## S4 method for signature 'TxDb'
get_velocity_files(
  X,
  L,
  Genome,
  Transcriptome,
  out_path,
  style = c("genome", "Ensembl", "UCSC", "NCBI", "other"),
  isoform_action = c("separate", "collapse"),
  exon_option = c("full", "junction"),
  compress_fa = FALSE,
  width = 80L,
  chrs_only = TRUE
)

## S4 method for signature 'EnsDb'
get_velocity_files(
  X,
  L,
  Genome,
  Transcriptome,
  out_path,
  style = c("genome", "Ensembl", "UCSC", "NCBI", "other"),
  isoform_action = c("separate", "collapse"),
  exon_option = c("full", "junction"),
  compress_fa = FALSE,
  width = 80L,
  use_transcript_version = TRUE,
  use_gene_version = TRUE,
  transcript_biotype_col = "TXBIOTYPE",
  gene_biotype_col = "GENEBIOTYPE",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE
)

Arguments

`X`	Gene annotation with transcript and exon information. It can be a path to a GTF file with annotation of exon coordinates of transcripts, preferably from Ensembl. In the metadata, the following fields are required: type (e.g. whether the range of interest is a gene or transcript or exon or CDS), gene ID, and transcript ID. These fields need not to have standard names, as long as their names are specified in arguments of this function. It can also be a `TxDb` object, such as from the Bioconductor package `TxDb.Hsapiens.UCSC.hg38.knownGene`. It can also be a `EnsDb` object.
`L`	Length of the biological read. For instance, 10xv1: 98 nt, 10xv2: 98 nt, 10xv3: 91 nt, Drop-seq: 50 nt. If in doubt check read length in a fastq file for biological reads with the `bash` commands: If the fastq file is gzipped, then do `⁠zcat your_file.fastq.gz \| head⁠` on Linux. If on Mac, then `zcat < your_file.fastq.gz \| head`. Then you will see lines with nucleotide bases. Copy one of those lines and determine its length with `str_length` in R or `⁠echo -n <the sequence> \| wc -c⁠` in `bash`. Which file corresponds to biological reads depends on the particular technology.
`Genome`	Either a `BSgenome` or a `XStringSet` object of genomic sequences, where the intronic sequences will be extracted from. Use `genomeStyles` to check which styles are supported for your organism of interest; supported styles can be interconverted. If the style in your genome or annotation is not supported, then the style of chromosome names in the genome and annotation should be manually set to be consistent.
`Transcriptome`	A `XStringSet`, a path to a fasta file (can be gzipped) of the transcriptome which contains sequences of spliced transcripts, or `NULL`. The transcriptome here will be concatenated with the intronic sequences to give one fasta file. When `NULL`, the transriptome sequences will be extracted from the genome given the gene annotation, so it will be guaranteed that transcript IDs in the transcriptome and in the annotation match. Otherwise, the type of transcript ID in the transcriptome must match that in the gene annotation supplied via argument `X`.
`out_path`	Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory.
`style`	Formatting of chromosome names. Use `genomeStyles` to check which styles are supported for your organism of interest and what those styles look like. This can also be a style supported for your organism different from the style used by the annotation and the genome. Then this style will be used for both the annotation and the genome. Can take the following values: genome If style of the annnotation is different from that of the genome, then the style of the genome will be used. other Custom style, need to manually ensure that the style in annotation matches that of the genome. Ensembl Or `UCSC` or `NCBI`, whichever is supported by your species of interest.
`isoform_action`	Character, indicating action to take with different transcripts of the same gene. Must be one of the following: collapse First, the union of all exons of different transcripts of a gene will be taken. Then the introns will be inferred from this union. Only the flanked intronic sequences are affected; isoforms will always be taken into account for spliced sequences or exon-exon junctions. separate Introns from different transcripts will be kept separate.
`exon_option`	Character, indicating how exonic sequences should be included in the kallisto index. Must be one of the following: full The full cDNA sequences, which include the full exonic sequences, will be used. This is the default. junction Only the exon-exon junctions, with L-1 bases on each side of the junctions, will be used.
`compress_fa`	Logical, whether to compress the output fasta file. If `TRUE`, then the fasta file will be gzipped.
`width`	Maximum number of letters per line of sequence in the output fasta file. Must be an integer.
`...`	Extra arguments for methods.
`transcript_id`	Character vector of length 1. Tag in `attribute` field corresponding to transcript IDs. This argument must be supplied and cannot be `NA` or `NULL`. Will throw error if tag indicated in this argument does not exist.
`gene_id`	Character vector of length 1. Tag in `attribute` field corresponding to gene IDs. This argument must be supplied and cannot be `NA` or `NULL`. Note that this is different from gene symbols, which do not have to be unique. This can be Ensembl or Entrez IDs. However, if the gene symbols are in fact unique for each gene, you may supply the tag for human readable gene symbols to this argument. Will throw error if tag indicated in this argument does not exist. This is typically "gene_id" for annotations from Ensembl and "gene" for refseq.
`transcript_version`	Character vector of length 1. Tag in `attribute` field corresponding to transcript version number. If your GTF file does not include transcript version numbers, or if you do not wish to include the version number, then use `NULL` for this argument. To decide whether to include transcript version number, check whether version numbers are included in the `transcripts.txt` in the `kallisto` output directory. If that file includes version numbers, then trannscript version numbers must be included here as well. If that file does not include version numbers, then transcript version numbers must not be included here.
`gene_version`	Character vector of length 1. Tag in `attribute` field corresponding to gene version number. If your GTF file does not include gene version numbers, or if you do not wish to include the version number, then use `NULL` for this argument. Unlike transcript version number, it's up to you whether to include gene version number.
`version_sep`	Character to separate bewteen the main ID and the version number. Defaults to ".", as in Ensembl.
`transcript_biotype_col`	Character vector of length 1. Tag in `attribute` field corresponding to transcript biotype.
`gene_biotype_col`	Character vector of length 1. Tag in `attribute` field corresponding to gene biotype.
`transcript_biotype_use`	Character, can be "all" or a vector of transcript biotypes to be used. Transcript biotypes aren't entirely the same as gene biotypes. For instance, in Ensembl annotation, `retained_intron` is a transcript biotype, but not a gene biotype. If "cellranger", then a warning will be given. See `data("ensembl_tx_biotypes")` for all available transcript biotypes from Ensembl.
`gene_biotype_use`	Character, can be "all", "cellranger", or a vector of gene biotypes to be used. If "cellranger", then the biotypes used by Cell Ranger's reference are used. See `data("cellranger_biotypes")` for gene biotypes the Cell Ranger reference uses. See `data("ensembl_gene_biotypes")` for all available gene biotypes from Ensembl. Note that gene biotypes and transcript biotypes are not always the same.
`chrs_only`	Logical, whether to include chromosomes only, for GTF and GFF files can contain annotations for scaffolds, which are not incorporated into chromosomes. This will also exclude haplotypes. Defaults to `TRUE`. Only applicable to species found in `genomeStyles()`.
`save_filtered_gtf`	Logical. If filtering type, biotypes, and/or chromosomes, whether to save the filtered `GRanges` as a GTF file.
`is_circular`	Logical vector of the same length as the number of sequences in the annotation and with the same names as the sequences, indicating whether the sequence is circular. If `NULL`, then all sequences will be assumed to be linear.
`use_transcript_version`	Logical, whether to include version number in the Ensembl transcript ID.
`use_gene_version`	Logical, whether to include version number in the Ensembl gene ID. Unlike transcript version number, it's up to you whether to include gene version number.

Value

The following files will be written to disk in the directory out_path:

cDNA_introns.fa: A fasta file containing both the spliced transcripts and the flanked intronic sequences. The intronic sequences are flanked by L-1 nt of exonic sequences to capture reads from nascent transcript partially mapping to exons. If the exon is shorter than 2*(L-1) nt, then the entire exon will be included in the intronic sequence. This will be used to build the kallisto index.
cDNA_tx_to_capture.txt: A text file of transcript IDs of spliced transcripts. If exon_option == "junction", then IDs of the exon-exon junctions. These IDs will have the pattern "transcript ID"-Jx, where x is a number differentiating between different junctions of the same transcript. Here x will always be ordered from 5' to 3' as on the plus strand.
introns_tx_to_capture.txt: A text file of IDs of introns. The names will have the pattern "transcript ID"-Ix, where x is a number differentiating between introns of the same transcript. If all transcripts of the same gene are collapsed before inferring intronic sequences, gene ID will be used in place of transcript ID. Here x will always be ordered from 5' to 3' as on the plus strand.
tr2g.txt: A text file with two columns matching transcripts and introns to genes. The first column is transcript or intron ID, and the second column is the corresponding gene ID. The part for transcripts are generated from the gene annotation supplied.

Nothing is returned into the R session.

Examples

# Use toy example
toy_path <- system.file("testdata", package = "BUSpaRse")
file <- paste0(toy_path, "/velocity_annot.gtf")
genome <- Biostrings::readDNAStringSet(paste0(toy_path, "/velocity_genome.fa"))
transcriptome <- paste0(toy_path, "/velocity_tx.fa")
get_velocity_files(file, 11, genome, transcriptome, ".",
  gene_version = NULL, transcript_version = NULL)
# Clean up output of the example
file.remove("cDNA_introns.fa", "cDNA_tx_to_capture.txt",
            "introns_tx_to_capture.txt", "tr2g.txt")

lambdamoses/BUStoolsR documentation built on Aug. 1, 2024, 6:30 a.m.