readAligned: (Legacy) Read aligned reads and their quality scores into R...
In ShortRead: FASTQ input and manipulation

Description Usage Arguments Details Value Author(s) See Also Examples

Import files containing aligned reads into an internal representation of the alignments, sequences, and quality scores. Most methods (see ‘details’ for exceptions) read all files into a single R object.

1	readAligned(dirPath, pattern=character(0), ...)

`dirPath`	A character vector (or other object; see methods defined on this generic) giving the directory path (relative or absolute; some methods also accept a character vector of file names) of aligned read files to be input.
`pattern`	The (`grep`-style) pattern describing file names to be read. The default (`character(0)`) results in (attempted) input of all files in the directory.
`...`	Additional arguments, used by methods. When `dirPath` is a character vector, the argument `type` must be provided. Possible values for `type` and their meaning are described below. Most methods implement `filter=srFilter()`, allowing objects of `SRFilter` to selectively returns aligned reads.

There is no standard aligned read file format; methods parse particular file types.

The readAligned,character-method interprets file types based on an additional type argument. Supported types are:

type="SolexaExport"

This type parses .*_export.txt files following the documentation in the Solexa Genome Alignment software manual, version 0.3.0. These files consist of the following columns; consult Solexa documentation for precise descriptions. If parsed, values can be retrieved from AlignedRead as follows:

Machine: see below
Run number: stored in alignData
Lane: stored in alignData
Tile: stored in alignData
X: stored in alignData
Y: stored in alignData
Multiplex index: see below
Paired read number: see below
Read: sread
Quality: quality
Match chromosome: chromosome
Match contig: alignData
Match position: position
Match strand: strand
Match description: Ignored
Single-read alignment score: alignQuality
Paired-read alignment score: Ignored
Partner chromosome: Ignored
Partner contig: Ignored
Partner offset: Ignored
Partner strand: Ignored
Filtering: alignData

The following optional arguments, set to FALSE by default, influence data input

withMultiplexIndex: When TRUE, include the multiplex index as a column multiplexIndex in alignData.
withPairedReadNumber: When TRUE, include the paired read number as a column pairedReadNumber in alignData.
withId: When TRUE, construct an identifier string as ‘Machine_Run:Lane:Tile:X:Y#multiplexIndex/pairedReadNumber’. The substrings ‘#multiplexIndex’ and ‘/pairedReadNumber’ are not present if withMultiplexIndex=FALSE or withPairedReadNumber=FALSE.
withAll: A convencience which, when TRUE, sets all with* values to TRUE.

Note that not all paired read columns are interpreted. Different interfaces to reading alignment files are described in SolexaPath and SolexaSet.

type="SolexaPrealign"

See SolexaRealign

type="SolexaAlign"

See SolexaRealign

type="SolexaRealign"

These types parse s_L_TTTT_prealign.txt, s_L_TTTT_align.txt or s_L_TTTT_realign.txt files produced by default and eland analyses. From the Solexa documentation, align corresponds to unfiltered first-pass alignments, prealign adjusts alignments for error rates (when available), realign filters alignments to exclude clusters failing to pass quality criteria.

Because base quality scores are not stored with alignments, the object returned by readAligned scores all base qualities as -32.

If parsed, values can be retrieved from AlignedRead as follows:

Sequence: stored in sread
Best score: stored in alignQuality
Number of hits: stored in alignData
Target position: stored in position
Strand: stored in strand
Target sequence: Ignored; parse using readXStringColumns
Next best score: stored in alignData

type="SolexaResult"

This parses s_L_eland_results.txt files, an intermediate format that does not contain read or alignment quality scores.

Because base quality scores are not stored with alignments, the object returned by readAligned scores all base qualities as -32.

Columns of this file type can be retrieved from AlignedRead as follows (description of columns is from Table 19, Genome Analyzer Pipeline Software User Guide, Revision A, January 2008):

Id: Not parsed
Sequence: stored in sread
Type of match code: Stored in alignData as matchCode. Codes are (from the Eland manual): NM (no match); QC (no match due to quality control failure); RM (no match due to repeat masking); U0 (best match was unique and exact); U1 (best match was unique, with 1 mismatch); U2 (best match was unique, with 2 mismatches); R0 (multiple exact matches found); R1 (multiple 1 mismatch matches found, no exact matches); R2 (multiple 2 mismatch matches found, no exact or 1-mismatch matches).
Number of exact matches: stored in alignData as nExactMatch
Number of 1-error mismatches: stored in alignData as nOneMismatch
Number of 2-error mismatches: stored in alignData as nTwoMismatch
Genome file of match: stored in chromosome
Position: stored in position
Strand: (direction of match) stored in strand
‘N’ treatment: stored in alignData, as NCharacterTreatment. ‘.’ indicates treatment of ‘N’ was not applicable; ‘D’ indicates treatment as deletion; ‘|’ indicates treatment as insertion
Substitution error: stored in alignData as mismatchDetailOne and mismatchDetailTwo. Present only for unique inexact matches at one or two positions. Position and type of first substitution error, e.g., 11A represents 11 matches with 12th base an A in reference but not read. The reference manual cited below lists only one field (mismatchDetailOne), but two are present in files seen in the wild.

type="MAQMap", records=-1L

Parse binary map files produced by MAQ. See details in the next section. The records option determines how many lines are read; -1L (the default) means that all records are input. For type="MAQMap", dir and pattern must match a single file.

type="MAQMapShort", records=-1L

The same as type="MAQMap" but for map files made with Maq prior to version 0.7.0. (These files use a different maximum read length [64 instead of 128], and are hence incompatible with newer Maq map files.). For type="MAQMapShort", dir and pattern must match a single file.

type="MAQMapview"

Parse alignment files created by MAQ's ‘mapiew’ command. Interpretation of columns is based on the description in the MAQ manual, specifically

        ...each line consists of read name, chromosome, position,
        strand, insert size from the outer coordinates of a pair,
        paired flag, mapping quality, single-end mapping quality,
        alternative mapping quality, number of mismatches of the
        best hit, sum of qualities of mismatched bases of the best
        hit, number of 0-mismatch hits of the first 24bp, number
        of 1-mismatch hits of the first 24bp on the reference,
        length of the read, read sequence and its quality.

The read name, read sequence, and quality are read as XStringSet objects. Chromosome and strand are read as factors. Position is numeric, while mapping quality is numeric. These fields are mapped to their corresponding representation in AlignedRead objects.

Number of mismatches of the best hit, sum of qualities of mismatched bases of the best hit, number of 0-mismatch hits of the first 24bp, number of 1-mismatch hits of the first 24bp are represented in the AlignedRead object as components of alignData.

Remaining fields are currently ignored.

type="Bowtie"

Parse alignment files created with the Bowtie alignment algorithm. Parsed columns can be retrieved from AlignedRead as follows:

Identifier: id
Strand: strand
Chromosome: chromosome
Position: position; see comment below
Read: sread; see comment below
Read quality: quality; see comments below
Similar alignments: alignData, ‘similar’ column; Bowtie v. 0.9.9.3 (12 May, 2009) documents this as the number of other instances where the same read aligns against the same reference characters as were aligned against in this alignment. Previous versions marked this as ‘Reserved’
Alignment mismatch locations: alignData ‘mismatch’, column

NOTE: the default quality encoding changes to FastqQuality with ShortRead version 1.3.24.

This method includes the argument qualityType to specify how quality scores are encoded. Bowtie quality scores are ‘Phred’-like by default, with qualityType='FastqQuality', but can be specified as ‘Solexa’-like, with qualityType='SFastqQuality'.

Bowtie outputs positions that are 0-offset from the left-most end of the + strand. ShortRead parses position information to be 1-offset from the left-most end of the + strand.

Bowtie outputs reads aligned to the - strand as their reverse complement, and reverses the quality score string of these reads. ShortRead parses these to their original sequence and orientation.

type="SOAP"

Parse alignment files created with the SOAP alignment algorithm. Parsed columns can be retrieved from AlignedRead as follows:

id: id
seq: sread; see comment below
qual: quality; see comment below
number of hits: alignData
a/b: alignData (pairedEnd)
length: alignData (alignedLength)
+/-: strand
chr: chromosome
location: position; see comment below
types: alignData (typeOfHit: integer portion; hitDetail: text portion)

This method includes the argument qualityType to specify how quality scores are encoded. It is unclear from SOAP documentation what the quality score is; the default is ‘Solexa’-like, with qualityType='SFastqQuality', but can be specified as ‘Phred’-like, with qualityType='FastqQuality'.

SOAP outputs positions that are 1-offset from the left-most end of the + strand. ShortRead preserves this representation.

SOAP reads aligned to the - strand are reported by SOAP as their reverse complement, with the quality string of these reads reversed. ShortRead parses these to their original sequence and orientation.

A single R object (e.g., AlignedRead) containing alignments, sequences and qualities of all files in dirPath matching pattern. There is no guarantee of order in which files are read.

Martin Morgan <mtmorgan@fhcrc.org>, Simon Anders <anders@ebi.ac.uk> (MAQ map)

The AlignedRead class.

Genome Analyzer Pipeline Software User Guide, Revision A, January 2008.

The MAQ reference manual, http://maq.sourceforge.net/maq-manpage.shtml#5, 3 May, 2008.

The Bowtie reference manual, http://bowtie-bio.sourceforge.net, 28 October, 2008.

The SOAP reference manual, http://soap.genomics.org.cn/soap1, 16 December, 2008.

sp <- SolexaPath(system.file("extdata", package="ShortRead"))
ap <- analysisPath(sp)
## ELAND_EXTENDED
(aln0 <- readAligned(ap, "s_2_export.txt", "SolexaExport"))
## PhageAlign
(aln1 <- readAligned(ap, "s_5_.*_realign.txt", "SolexaRealign"))

## MAQ
dirPath <- system.file('extdata', 'maq', package='ShortRead')
list.files(dirPath)
## First line
readLines(list.files(dirPath, full.names=TRUE)[[1]], 1)
countLines(dirPath)
## two files collapse into one
(aln2 <- readAligned(dirPath, type="MAQMapview"))

## select only chr1-5.fa, '+' strand
filt <- compose(chromosomeFilter("chr[1-5].fa"),
                strandFilter("+"))
(aln3 <- readAligned(sp, "s_2_export.txt", filter=filt))