align: Align Sequence Reads to a Reference Genome via Seed-and-Vote
In Rsubread: Mapping, quantification and variant analysis of sequencing data

Description Usage Arguments Details Value Author(s) References See Also Examples

The align function can align both DNA and RNA sequencing reads. Subjunc is an RNA-seq aligner and it reports full alignment of each read (align reports partial alignment for exon spanning reads).

align(

    # index for reference sequences
    index,
    
    # input reads and output
    readfile1,
    readfile2 = NULL,
    type = "rna",
    input_format = "gzFASTQ",
    output_format = "BAM",
    output_file = paste(readfile1,"subread",output_format,sep="."),
    
    # offset value added to Phred quality scores of read bases
    phredOffset = 33,
    
    # thresholds for mapping
    nsubreads = 10,
    TH1 = 3,
    TH2 = 1,
    maxMismatches = 3,
    
    # unique mapping and multi-mapping
    unique = FALSE,
    nBestLocations = 1,
    
    # indel detection
    indels = 5,
    complexIndels = FALSE,
    
    # read trimming
    nTrim5 = 0,
    nTrim3 = 0,
    
    # distance and orientation of paired end reads
    minFragLength = 50,
    maxFragLength = 600,
    PE_orientation = "fr",
    
    # number of CPU threads
    nthreads = 1,
    
    # read group
    readGroupID = NULL,
    readGroup = NULL,
    
    # read order
    keepReadOrder = FALSE,
    sortReadsByCoordinates = FALSE,
    
    # color space reads
    color2base = FALSE,
    
    # dynamic programming
    DP_GapOpenPenalty = -1,
    DP_GapExtPenalty = 0,
    DP_MismatchPenalty = 0,
    DP_MatchScore = 2,
    
    # detect structural variants
    detectSV = FALSE,
    
    # gene annotation
    useAnnotation = FALSE,
    annot.inbuilt = "mm10",
    annot.ext = NULL,
    isGTF = FALSE,
    GTF.featureType = "exon",
    GTF.attrType = "gene_id",
    chrAliases = NULL)
    
subjunc(

    # index for reference sequences
    index,
    
    # input reads and output
    readfile1,
    readfile2 = NULL,
    input_format = "gzFASTQ",
    output_format = "BAM",
    output_file = paste(readfile1,"subjunc",output_format,sep = "."),
    
    # offset value added to Phred quality scores of read bases
    phredOffset = 33,
    
    # thresholds for mapping
    nsubreads = 14,
    TH1 = 1,
    TH2 = 1,
    maxMismatches = 3,
    
    # unique mapping and multi-mapping
    unique = FALSE,
    nBestLocations = 1,
    
    # indel detection
    indels = 5,
    complexIndels = FALSE,
    
    # read trimming
    nTrim5 = 0,
    nTrim3 = 0,
    
    # distance and orientation of paired end reads
    minFragLength = 50,
    maxFragLength = 600,
    PE_orientation = "fr",
    
    # number of CPU threads
    nthreads = 1,
    
    # read group
    readGroupID = NULL,
    readGroup = NULL,
    
    # read order
    keepReadOrder = FALSE,
    sortReadsByCoordinates = FALSE,
    
    # color space reads
    color2base = FALSE,
    
    # dynamic programming
    DP_GapOpenPenalty = -1,
    DP_GapExtPenalty = 0,
    DP_MismatchPenalty = 0,
    DP_MatchScore = 2,
    
    # detect all junctions including gene fusions
    reportAllJunctions = FALSE,
    
    # gene annotation
    useAnnotation = FALSE,
    annot.inbuilt = "mm10",
    annot.ext = NULL,
    isGTF = FALSE,
    GTF.featureType = "exon",
    GTF.attrType = "gene_id",
    chrAliases = NULL)

`index`	character string giving the basename of index file. Index files should be located in the current directory.
`readfile1`	a character vector including names of files that include sequence reads to be aligned. For paired-end reads, this gives the list of files including first reads in each library. File format is FASTQ/FASTA by default. See `input_format` option for more supported formats.
`readfile2`	a character vector giving names of files that include second reads in paired-end read data. Files included in `readfile2` should be in the same order as their mate files included in `readfile1` .`NULL` by default.
`type`	a character string or an integer giving the type of sequencing data. Possible values include `rna` (or `0`; RNA-seq data) and `dna` (or `1`; genomic DNA-seq data such as WGS, WES, ChIP-seq data etc.). Character strings are case insensitive.
`input_format`	character string specifying format of read input files. `gzFASTQ` by default (this also includes FASTQ, FASTA, gzipped FASTQ and gzipped FASTA formats). Other supported formats include `SAM` and `BAM`. Character values are case insensitive.
`output_format`	character string specifying format of output file. `BAM` by default. Acceptable formats include `SAM` and `BAM`.
`output_file`	a character vector specifying names of output files. By default, names of output files are set as the file names provided in `readfile1` added with an suffix string.
`phredOffset`	numeric value added to base-calling Phred scores to make quality scores (represented as ASCII letters). Possible values include `33` and `64`. By default, `33` is used.
`nsubreads`	numeric value giving the number of subreads extracted from each read.
`TH1`	numeric value giving the consensus threshold for reporting a hit. This is the threshold for the first reads if paired-end read data are provided.
`TH2`	numeric value giving the consensus threhold for the second reads in paired-end data.
`maxMismatches`	numeric value giving the maximum number of mis-matched bases allowed in the alignment. `3` by default. Mis-matches found in soft-clipped bases are not counted.
`unique`	logical indicating if only uniquely mapped reads should be reported. A uniquely mapped read has one single mapping location that has less mis-matched bases than any other candidate locations. By default, multi-mapping reads will be reported in addition to uniquely mapped reads. Number of alignments reported for each multi-mapping read is determined by the `nBestLocations` parameter.
`nBestLocations`	numeric value specifying the maximal number of equally-best mapping locations that will be reported for a multi-mapping read. 1 by default. The allowed value is between 1 to 16 (inclusive). In the mapping output, ‘NH’ tag is used to indicate how many alignments are reported for the read and ‘HI’ tag is used for numbering the alignments reported for the same read. This argument is only applicable when `unique` option is set to `FALSE`.
`indels`	numeric value giving the maximum number of insertions/deletions allowed during the mapping. `5` by default.
`complexIndels`	logical indicating if complex indels will be detected. If `TRUE`, the program will try to detect multiple short indels that occurs concurrently in a small genomic region (indels could be as close as 1bp apart).
`nTrim5`	numeric value giving the number of bases trimmed off from 5' end of each read. `0` by default.
`nTrim3`	numeric value giving the number of bases trimmed off from 3' end of each read. `0` by default.
`minFragLength`	numeric value giving the minimum fragment length. 50 by default.
`maxFragLength`	numeric value giving the maximum fragment length. 600 by default.
`PE_orientation`	character string giving the orientation of the two reads from the same pair. It has three possible values including `fr`, `ff` and `rf`. Letter `f` denotes the forward strand and letter `r` the reverse strand. `fr` by default (ie. the first read in the pair is on the forward strand and the second read on the reverse strand).
`nthreads`	numeric value giving the number of threads used for mapping. `1` by default.
`readGroupID`	a character string giving the read group ID. The specified string is added to the read group header field and also be added to each read in the mapping output. `NULL` by default.
`readGroup`	a character string in the format of `tag:value`. This string will be added to the read group (RG) header in the mapping output. `NULL` by default.
`keepReadOrder`	logical indicating if the order of reads in the BAM output is kept the same as that in the input file. (Reads from the same pair are always placed next to each other regardless of this option).
`sortReadsByCoordinates`	logical indicating if reads will be sorted by their mapping locations in the mapping output. This option is applicable for BAM output only. A BAI index file will also be generated for each BAM file so the generated BAM files can be directly loaded into a genome browser such as IGB or IGV. `FALSE` by default.
`color2base`	logical. If `TRUE`, color-space read bases will be converted to base-space bases in the mapping output. Note that the mapping itself will still be performed at color-space. `FALSE` by default.
`DP_GapOpenPenalty`	a numeric value giving the penalty for opening a gap when using the Smith-Waterman dynamic programming algorithm to detect insertions and deletions. The Smith-Waterman algorithm is only applied for those reads which are found to contain insertions or deletions. `-1` by default.
`DP_GapExtPenalty`	a numeric value giving the penalty for extending the gap, used by the Smith-Waterman algorithm. `0` by default.
`DP_MismatchPenalty`	a numeric value giving the penalty for mismatches, used by the Smith-Waterman algorithm. `0` by default.
`DP_MatchScore`	a numeric value giving the score for matches used by the Smith-Waterman algorithm. `2` by default.
`detectSV`	logical indicating if structural variants (SVs) will be detected during read mapping. See below for more details.
`reportAllJunctions`	logical indicating if all discovered junctions will be reported. This includes exon-exon junctions and also gene fusions. Presence of donor/receptor sites is not required when `reportAllJunctions` is `TRUE`. This option should only be used for RNA-seq data.
`useAnnotation`	logical indicating if gene annotation information will be used in the mapping. `FALSE` by default.
`annot.inbuilt`	a character string specifying an in-built annotation used for read summarization. It has four possible values including `"mm10"`, `"mm9"`, `"hg38"` and `"hg19"`, corresponding to the NCBI RefSeq annotations for genomes ‘mm10’, ‘mm9’, ‘hg38’ and ‘hg19’, respectively. `"mm10"` by default. See `featureCounts` function for more details on the in-built annotations.
`annot.ext`	A character string giving name of a user-provided annotation file or a data frame including user-provided annotation data. If the annotation is in GTF format, it can only be provided as a file. If it is in SAF format, it can be provided as a file or a data frame. The provided annotation file can be a uncompressed or gzip compressed file. If an annotation is provided via `annot.ext`, the `annot.inbuilt` parameter will be ignored. See `featureCounts` function for more details about this parameter.
`isGTF`	logical indicating if the annotation provided via the `annot.ext` argument is in GTF format or not. `FALSE` by default. This option is only applicable when `annot.ext` is not `NULL`.
`GTF.featureType`	a character string denoting the type of features that will be extracted from a GTF annotation. `"exon"` by default. This argument is only applicable when `isGTF` is `TRUE`.
`GTF.attrType`	a character string denoting the type of attributes in a GTF annotation that will be used to group features. `"gene_id"` by default. The grouped features are called meta-features. For instance, a feature can be an exon and several exons can be grouped into a gene (meta-feature). This argument is only applicable when `isGTF` is `TRUE`.
`chrAliases`	a character string providing the name of a comma-delimited text file that includes aliases of chromosome names. This file should contain two columns. First column contains names of chromosomes included in the SAF or GTF annotation and second column contains corresponding names of chromosomes in the reference genome. No column headers should be provided. Also note that chromosome names are case sensitive. This file can be used to match chromosome names between the annotation and the reference genome.

align and subjunc are R versions of programs in the Subread suite of programs (http://subread.sourceforge.net) that implement fast, accurate sequence read alignment based on the "seed-and-vote" mapping paradigm (Liao et al. 2013; Liao et al. 2019). align is the R version of the command-line program subread-align and subjunc is the R version of the command-line program subjunc.

align is a general-purpose aligner that can be used to align both genomic DNA-seq reads and RNA-seq reads. subjunc is designed for mapping RNA-seq reads only. The main difference between subjunc and align for RNA-seq alignment is that subjunc reports discovered exon-exon junctions and it also performs full alignments for every read including exon-spanning reads. align instead reports the largest mappable regions for each read and soft-clips the remainder of the read. subjunc requires the presence of donor and receptor sites when calling exon-exon junctions. It can detect up to four junction locations in a junction read.

align and subjunc can be run with either a full index or a gapped index (see buildindex for more details). Maximum mapping speed is achieved when the full index is used. With a full index built for human/mouse genome, align maps ~14 million reads per minute with 10 threads. subjunc is slightly slower than align. Both align and subjunc require 17.8GB of memory for mapping. With a gapped index built for human/mouse genome, align maps ~9 million reads per minute with 10 threads. align requires 8.2GB of memory for mapping and subjunc requires 8.8GB of memory.

The detectSV argument of align can be used to detect structural variants (SVs) in genomic DNA sequencing data. The reportAllJunctions argument of subjunc function can be used for the detection of SVs in RNA-seq data, as well as the detection of non-canonical junctions that do not have the canonical donor/receptor sites. For each library, breakpoints detected from SV events are saved to a file with suffix name ‘.breakpoints.txt’, which includes chromosomal coordinates of SV breakpoints and numbers of supporting reads. The BAM/SAM output includes extra fields to describe the complete alignments of breakpoint-containing reads. For a breakpoint-containing read, mapping of its major sequence segment is described in the main fields of BAM/SAM output whereas mapping of its minor sequence segment, which does not map along with the major segment due to the presence of a breakpoint, is described in the extra fields including 'CC'(Chr), 'CP'(Position),'CG'(CIGAR) and 'CT'(strand). Note that each breakpoint-containing read occupies only one row in the BAM/SAM output.

For each library, mapping results including both aligned and unaligned reads are saved to a file. Indels discovered during mapping are saved to a VCF format file (.indel.vcf). subjunc also outputs a BED format file that contains list of discovered exon-exon junctions (.junction.bed). If detectSV or reportAllJunctions options are set to TRUE, structural variants discovered during read mapping will be reported and saved to a file (.breakpoints.txt), which includes chromosomal coordinates of breakpoints and number of supporting reads for each breakpoint.

Other than outputting data to files, align and subjunc also return a data.frame that includes mapping statistics for each library, such as total number of reads, number of mapped reads, number of uniquely mapped reads, number of indels and number of junctions (subjunc only).

Wei Shi and Yang Liao

Yang Liao, Gordon K Smyth and Wei Shi (2019). The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Research, 47(8):e47. http://www.ncbi.nlm.nih.gov/pubmed/30783653

Yang Liao, Gordon K Smyth and Wei Shi (2013). The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Research, 41(10):e108. http://www.ncbi.nlm.nih.gov/pubmed/23558742

buildindex

# Build an index for the artificial sequence included in file 'reference.fa'.
ref <- system.file("extdata","reference.fa",package="Rsubread")

buildindex(basename="./reference_index",reference=ref)

# align a sample read dataset ('reads.txt') to the sample reference
reads <- system.file("extdata","reads.txt.gz",package="Rsubread")

align.stat <- align(index = "./reference_index", readfile1 = reads,
output_file = "./Rsubread_alignment.BAM", phredOffset = 64)

align.stat