knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)

Travis-CI Build Status

SamSeq

Allows loading and working with *.sam files, i.e. the format used by SamTools. Tracks header information, reads, and the source of the file. Loads all reads and optionally the extra read tags into S3 objects that are also data frames. This allows easy manipulation and requires much less overhead than using Bioconductor, but is significantly slower and uses more memory.

Includes functions to translate the Sam flags and to filter reads.

Loading a sam file

A Sam object is created/constructed by the Sam() function, usually by passing it the name of a *.sam file as a parameter. From this object, the sam header data can be accessed with SamHeader(), the sam read data with SamRead and the source information about the sam file with SamSource(). The following code uses a tiny sam file distributed with this package.

library(SamSeq)

# Get the name of a *. sam file to use in this example. If SamSeq is not
# installed in the normal place, specify the R library directory with
# the system.file parameter lib.loc=
sample.sam <- system.file( "extData/pe.sam", package = "SamSeq" )

samObj <- Sam( sample.sam )
class(samObj)

samHeaderObj <- SamHeader( samObj )
class(samHeaderObj)
samReadsObj <- SamReads( samObj )
class(samReadsObj)
samSourceObj <- SamSource( samObj )
class(samSourceObj)

SamHeader objects

The SamHeader is not fully parsed at this point, it will probably have a different structure in future. For now it is provided as an object that can be cast to a data frame with two columns: the initial tag as the column tag, and the rest of the line as the column record.

samHeaderObj <- SamHeader( samObj )
headerDF <- as.data.frame(samHeaderObj)

names(headerDF)
headerDF[1, ]
headerDF$tag
headerDF[headerDF$tag == "SQ", "record" ]

areHeaderTagsParsed(samHeaderObj)
areHeaderTagsParsed(samHeaderObj) == areHeaderTagsParsed(samObj)

The problem yet to be determined is how to split the contents when some header lines use tabs for delimiters and others can contain embedded unquoted tabs. Probably some form of parameter object that will allow specifying, on a per tag basis, what to do.

Vectors of raw header lines can also be parsed directly into a SamHeader object. Source information can be provided, by default it will all be NA.

headerLines <- c(
    "@HD\tVN:1.4",
    "@SQ\tSN:chr7\tLN:159138663"
)
headersObj <- SamHeader( headerLines )
headerDF <- as.data.frame(samHeaderObj)
headerDF[1:2, ]

SamReads objects

The reads in a *.sam file are represented as a SamReads object, which is a data frame containing the fixed 11 columns of read data with the same name as the sam fields, plus an additional column or columns containing the optional read tags. By default the optional read tags are parsed out into columns with the tag as the column name and the value converted to integer or double if appropriate, otherwise left as character.

samReadsObj <- SamReads( samObj )
readsDF <- as.data.frame(samReadsObj)

names(readsDF)
readsDF[1:2, ]
readsDF$flag

areReadTagsParsed(samReadsObj)
areReadTagsParsed(samReadsObj) == areReadTagsParsed(samObj)

If splitTags= TRUE is set when reading in the file with Sam(), tags will be left as a single unparsed tags column with values the unparsed tags delimited by embedded tabs. You might want to leave the optional tags unparsed if you don't need them as parsing them makes reading the file 4 times slower.

rawTagsSamObj <- Sam( sample.sam, splitTags = FALSE )
rawTagsReadObj <- SamReads( rawTagsSamObj )
rawReadsDf <- as.data.frame(rawTagsReadObj)

names(rawReadsDf)
rawReadsDf[1:2, ]
rawReadsDf$tags

areReadTagsParsed(rawTagsReadObj)
areReadTagsParsed(rawTagsReadObj) == areReadTagsParsed(rawTagsSamObj)

SamSource objects

The source of the sam data is automatically bound to Sam object created when the file is read in, and propagated to the SamReads and SamHeader objects. It can be obtained by the SamSource() accessor.

samFileSourceObj <- SamSource( samObj )
samHeaderSourceObj <- SamSource( SamHeader( samObj ))
samReadsSourceObj <- SamSource( SamReads( samObj ))
identical( samFileSourceObj, samHeaderSourceObj)
identical( samFileSourceObj, samReadsSourceObj)

A SamSource object has accessors samSourceName(), samSourceHost(), and samSourceType() which return the name of the file read, the name of machine with the file, and the type of access used to get the data, respectively. Currently the only type of access is file. A possible future type might be url.

The accessors can be applied directly to the Sam, SamHeader, and SamReads objects, or indeed any object for which SamSource(x) is defined.

samSource <- SamSource( samObj )
samSourceName( samSource )
samSourceHost( samSource )
samSourceType( samSource )

The accessors are S3 functions, and can be applied directly to the Sam, SamHeader, and SamReads objects without first extracting the SamSource object.

samSourceName( samObj )
samSourceHost( samHeaderObj )
samSourceType( samReadsObj )

Sam Flags

To be detailed - see ?samFlags and ?testSamFlags

Sam Filters

To be detailed - see ?pairedReads, ?samReadFilter, ?pairedEndReadTests, and ?readTests.



jefferys/SamSeq documentation built on Oct. 25, 2019, 8:11 a.m.