align: Align sequence reads to a reference genome via seed-and-vote

Description Usage Arguments Details Value Author(s) References Examples

View source: R/align.R

Description

align function can align both DNA and RNA sequencing reads. Subjunc is an RNA-seq aligner and it reports full alignment of each read (align reports partial alignment for exon spanning reads).

Usage

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
align(
# index for reference sequences
index,

# input reads and output
readfile1,
readfile2=NULL,
type="rna",
input_format="gzFASTQ",
output_format="BAM",
output_file=paste(as.character(readfile1),"subread",output_format,sep="."),

# offset value added to Phred quality scores of read bases
phredOffset=33,

# thresholds for mapping
nsubreads=10,
TH1=3,
TH2=1,
maxMismatches=3,

# unique mapping and multi-mapping
unique=FALSE,
nBestLocations=1,

# indel detection
indels=5,
complexIndels=FALSE,

# read trimming
nTrim5=0,
nTrim3=0,

# distance and orientation of paired end reads
minFragLength=50,
maxFragLength=600,
PE_orientation="fr",

# number of CPU threads
nthreads=1,

# read group
readGroupID=NULL,
readGroup=NULL,

# read order
keepReadOrder=FALSE,
sortReadsByCoordinates=FALSE,

# color space reads
color2base=FALSE,

# dynamic programming
DP_GapOpenPenalty=-1,
DP_GapExtPenalty=0,
DP_MismatchPenalty=0,
DP_MatchScore=2,

# detect structural variants
detectSV=FALSE,

# gene annotation
useAnnotation=FALSE,
annot.inbuilt="mm10",
annot.ext=NULL,
isGTF=FALSE,
GTF.featureType="exon",
GTF.attrType="gene_id",
chrAliases=NULL)

subjunc(
# index for reference sequences
index,

# input reads and output
readfile1,
readfile2=NULL,
input_format="gzFASTQ",
output_format="BAM",
output_file=paste(as.character(readfile1),"subjunc",output_format,sep="."),

# offset value added to Phred quality scores of read bases
phredOffset=33,

# thresholds for mapping
nsubreads=14,
TH1=1,
TH2=1,
maxMismatches=3,

# unique mapping and multi-mapping
unique=FALSE,
nBestLocations=1,

# indel detection
indels=5,
complexIndels=FALSE,

# read trimming
nTrim5=0,
nTrim3=0,

# distance and orientation of paired end reads
minFragLength=50,
maxFragLength=600,
PE_orientation="fr",

# number of CPU threads
nthreads=1,

# read group
readGroupID=NULL,
readGroup=NULL,

# read order
keepReadOrder=FALSE,
sortReadsByCoordinates=FALSE,

# color space reads
color2base=FALSE,

# dynamic programming
DP_GapOpenPenalty=-1,
DP_GapExtPenalty=0,
DP_MismatchPenalty=0,
DP_MatchScore=2,

# detect all junctions including gene fusions
reportAllJunctions=FALSE,

# gene annotation
useAnnotation=FALSE,
annot.inbuilt="mm10",
annot.ext=NULL,
isGTF=FALSE,
GTF.featureType="exon",
GTF.attrType="gene_id",
chrAliases=NULL)

Arguments

index

character string giving the basename of index file. Index files should be located in the current directory.

readfile1

a character vector including names of files that include sequence reads to be aligned. For paired-end reads, this gives the list of files including first reads in each library. File format is FASTQ/FASTA by default. See input_format option for more supported formats.

readfile2

a character vector giving names of files that include second reads in paired-end read data. Files included in readfile2 should be in the same order as their mate files included in readfile1 .NULL by default.

type

a character string or an integer giving the type of sequencing data. Possible values include rna (or 0; RNA-seq data) and dna (or 1; genomic DNA-seq data such as WGS, WES, ChIP-seq data etc.). Character strings are case insensitive.

input_format

character string specifying format of read input files. gzFASTQ by default (this also includes FASTQ, FASTA and gzipped FASTA formats). Other supported formats include SAM and BAM. Character values are case insensitive.

output_format

character string specifying format of output file. BAM by default. Acceptable formats include SAM and BAM.

output_file

a character vector specifying names of output files. By default, names of output files are set as the file names provided in readfile1 added with an suffix string.

phredOffset

numeric value added to base-calling Phred scores to make quality scores (represented as ASCII letters). Possible values include 33 and 64. By default, 33 is used.

nsubreads

numeric value giving the number of subreads extracted from each read.

TH1

numeric value giving the consensus threshold for reporting a hit. This is the threshold for the first reads if paired-end read data are provided.

TH2

numeric value giving the consensus threhold for the second reads in paired-end data.

maxMismatches

numeric value giving the maximum number of mis-matched bases allowed in the alignment. 3 by default. Mis-matches found in soft-clipped bases are not counted.

unique

logical indicating if only uniquely mapped reads should be reported. A uniquely mapped read has one single mapping location that has less mis-matched bases than any other candidate locations. By default, multi-mapping reads will be reported in addition to uniquely mapped reads. Number of alignments reported for each multi-mapping read is determined by the nBestLocations parameter.

nBestLocations

numeric value specifying the maximal number of equally-best mapping locations that will be reported for a multi-mapping read. 1 by default. The allowed value is between 1 to 16 (inclusive). In the mapping output, ‘NH’ tag is used to indicate how many alignments are reported for the read and ‘HI’ tag is used for numbering the alignments reported for the same read. This argument is only applicable when unique option is set to FALSE.

indels

numeric value giving the maximum number of insertions/deletions allowed during the mapping. 5 by default.

complexIndels

logical indicating if complex indels will be detected. If TRUE, the program will try to detect multiple short indels that occurs concurrently in a small genomic region (indels could be as close as 1bp apart).

nTrim5

numeric value giving the number of bases trimmed off from 5' end of each read. 0 by default.

nTrim3

numeric value giving the number of bases trimmed off from 3' end of each read. 0 by default.

minFragLength

numeric value giving the minimum fragment length. 50 by default.

maxFragLength

numeric value giving the maximum fragment length. 600 by default.

PE_orientation

character string giving the orientation of the two reads from the same pair. It has three possible values including fr, ff and rf. Letter f denotes the forward strand and letter r the reverse strand. fr by default (ie. the first read in the pair is on the forward strand and the second read on the reverse strand).

nthreads

numeric value giving the number of threads used for mapping. 1 by default.

readGroupID

a character string giving the read group ID. The specified string is added to the read group header field and also be added to each read in the mapping output. NULL by default.

readGroup

a character string in the format of tag:value. This string will be added to the read group (RG) header in the mapping output. NULL by default.

keepReadOrder

logical indicating if the order of reads in the BAM output is kept the same as that in the input file. Reads from the same pair are always placed next to each other no matter this option is specified or not.

sortReadsByCoordinates

logical indicating if reads will be sorted by their mapping locations in the mapping output. This option is applicable for BAM output only. A BAI index file will also be generated for each BAM file so the generated BAM files can be directly loaded into a genome browser such as IGB or IGV. FALSE by default.

color2base

logical. If TRUE, color-space read bases will be converted to base-space bases in the mapping output. Note that the mapping itself will still be performed at color-space. FALSE by default.

DP_GapOpenPenalty

a numeric value giving the penalty for opening a gap when using the Smith-Waterman dynamic programming algorithm to detect insertions and deletions. The Smith-Waterman algorithm is only applied for those reads which are found to contain insertions or deletions. -1 by default.

DP_GapExtPenalty

a numeric value giving the penalty for extending the gap, used by the Smith-Waterman algorithm. 0 by default.

DP_MismatchPenalty

a numeric value giving the penalty for mismatches, used by the Smith-Waterman algorithm. 0 by default.

DP_MatchScore

a numeric value giving the score for matches used by the Smith-Waterman algorithm. 2 by default.

detectSV

logical indicating if structural variants (SVs) will be detected during read mapping. See below for more details.

reportAllJunctions

logical indicating if all discovered junctions will be reported. This includes exon-exon junctions and also gene fusions. Presence of donor/receptor sites is not required when reportAllJunctions is TRUE. This option should only be used for RNA-seq data.

useAnnotation

logical indicating if gene annotation information will be used in the mapping. FALSE by default.

annot.inbuilt

a character string specifying an in-built annotation used for read summarization. It has four possible values including "mm10", "mm9", "hg38" and "hg19", corresponding to the NCBI RefSeq annotations for genomes ‘mm10’, ‘mm9’, ‘hg38’ and ‘hg19’, respectively. "mm10" by default. See featureCounts function for more details on the in-built annotations.

annot.ext

A character string giving name of a user-provided annotation file or a data frame including user-provided annotation data. If the annotation is in GTF format, it can only be provided as a file. If it is in SAF format, it can be provided as a file or a data frame. The provided annotation file can be a uncompressed or gzip compressed file. If an annotation is provided via annot.ext, the annot.inbuilt parameter will be ignored. See featureCounts function for more details about this parameter.

isGTF

logical indicating if the annotation provided via the annot.ext argument is in GTF format or not. FALSE by default. This option is only applicable when annot.ext is not NULL.

GTF.featureType

a character string denoting the type of features that will be extracted from a GTF annotation. "exon" by default. This argument is only applicable when isGTF is TRUE.

GTF.attrType

a character string denoting the type of attributes in a GTF annotation that will be used to group features. "gene_id" by default. The grouped features are called meta-features. For instance, a feature can be an exon and several exons can be grouped into a gene (meta-feature). This argument is only applicable when isGTF is TRUE.

chrAliases

a character string providing the name of a comma-delimited text file that includes aliases of chromosome names. This file should contain two columns. First column contains names of chromosomes included in the SAF or GTF annotation and second column contains corresponding names of chromosomes in the reference genome. No column headers should be provided. Also note that chromosome names are case sensitive. This file can be used to match chromosome names between the annotation and the reference genome.

Details

The align function is an R wrapper for the Subread aligner (Liao et al., 2013) that uses a novel mapping paradigm called “seed-and-vote". Subread is general-purpose aligner that can be used to align both genomic DNA-seq reads and RNA-seq reads. The subjunc function is an R wrapper for the Subjunc aligner (Liao et al., 2013) that is designed for mapping RNA-seq reads only.

The main difference between subjunc and align is that subjunc reports discovered exon-exon junctions and it also performs full alignments for every read including exon-spanning reads. align reports the largest mappable regions for each read.

align and subjunc can be run with a full index or a gapped index (see buildindex for more details). Maximum mapping speed is achieved when a one-block full index is used for mapping.

subjunc requires the presence of donor and receptor sites when calling exon-exon junctions. It can detect up to four junction locations in a junction read.

The detectSV parameter in align function can be used to detect structural variants (SVs) in genomic DNA sequencing data. The reportAllJunctions parameter in subjunc function can be used for the detection of SVs in RNA-seq data, as well as the detection of non-canonical junctions that do not have the canonical donor/receptor sites. For each library, breakpoints detected from SV events are saved to a file with suffix name ‘.breakpoints.txt’, which includes chromosomal coordinates of SV breakpoints and numbers of supporting reads. The BAM/SAM output includes extra fields to describe the complete alignments of breakpoint-containing reads. For a breakpoint-containing read, mapping of its major sequence segment is described in the main fields of BAM/SAM ouptut whereas mapping of its minor sequence segment, which does not map along with the major segment due to the presence of a breakpoint, is described in the extra fields including ‘CC’(Chr), ‘CP’(Position),‘CG’(CIGAR) and ‘CT’(strand). Note that each breakpoint-containing read occupies only one row in BAM/SAM output.

Value

Return a data.frame object that includes mapping statistics for each library. Read mapping results are saved to BAM or SAM files in the current working directory. Short indels detected during mapping are saved to BED files. subjunc also outputs discovered exon-exon junctions to BED files (separate BED files).

Author(s)

Wei Shi and Yang Liao

References

Yang Liao, Gordon K Smyth and Wei Shi. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Research, 41(10):e108, 2013.

Examples

1
2
3
4
5
6
7
8
# Build an index for the artificial sequence included in file 'reference.fa'.
library(Rsubread)
ref <- system.file("extdata","reference.fa",package="Rsubread")
buildindex(basename="./reference_index",reference=ref)

# align a sample read dataset ('reads.txt') to the sample reference
reads <- system.file("extdata","reads.txt.gz",package="Rsubread")
align.stat <- align(index="./reference_index",readfile1=reads,output_file="./Rsubread_alignment.BAM",phredOffset=64)

Rsubread documentation built on Nov. 9, 2018, 6 p.m.