Introduction

The Sequence Read Archive (SRA) is a public repository of raw sequencing data and alignment information from high-throughput sequencing platforms, including Roche 454 GS System, Illumina Genome Analyzer, Applied Biosystems SOLiD System, Helicos Heliscope, Complete Genomics, and Pacific Biosciences SMRT.

The SRA was developed by the National Center for Biotechnology Information (NCBI). Sequence data submitted to SRA is validated for quality and is given a permanent accession number.

Access to SRA data enhances reproducibility and provides a rich source of data for secondary analysis. The SRA Toolkit provides an API to access SRA data in a number of common file formats such as FASTQ.

SRA2R makes use of the SRA SDK and provides access to the SRA directly from R.

Usage

SRA Overview and Introspection Functions

The following dependencies are necessary for using this package: Biostrings, XVector, IRanges, S4Vectors.

library(knitr)
opts_chunk$set(cache=TRUE)

Stats

The readCount2 function contains information of a specific run acession number, and count the total number reads and reads after alingnment.

library(SRA2R) 
readCount2('SRR789392')

getReference makes possible get access to all chomosome references and extract information as the canonical and common names, length and alignment counts for each.

getReference('SRR789392')

To have access to a complete nucleotide sequence for each reference, the function refBases will return a list with the full sequence. (NOT SURE IF THIS ONE SHOULD BE HERE OR NOT, IN STATS)

rb = refBases('SRR789392')

Getting Reads

To get access to the reads in a read collection, you can use any of the following functions: getFastqCount simply returns the full read count.

getFastqCount('SRR789392')

getFastqReads returns a list with the entire sequence of all reads in a read collection. The functions has two parameters: the first one is related to the run accession number following by an integer associated to the vizualization of the file i.e., if you use the number 10, will returns the first 10 rows of the sequence. If you change this number to -1, it will return the entire sequence.

getFastqReads('SRR789392',10)

getFastqReadsWithQuality returns a list with two arguments: the first one is the full sequence and the second cone, the quality score. The second argument follow the same parameters as getFastqReads.

getFastqReadsWithQuality('SRR789392',10)

To look the sequence for a specific region its necessary four parameters for getSRAReadsWithRegion function: the first one comprises the run accession number, the second is the canonical name of the reference, and the third and fourth positions are related to the start and end position of the region of interest.

getSRAReadsWithRegion('SRR789392','NC_000020.10', 62926240, 62958722)

Alignment

The function alignReadsWithRegion returns a list with information about the alignment for a SRA file. Its possible to access data as readID, Canonical Name, position, longCigar, mappingQuality, alignmentLength and sequence. Four parameters anre need for running alignReadsWithRegion function: the first one comprises the run accession number, the second is the canonical name of the reference, and the third and fourth positions are related to the start and end position of the region of interest.

alignReadsWithRegion('SRR789392','NC_000020.10', 62926240, 62958722)

For the full sequence aligned try the function alignReads.

ar = alignReads('SRR789392')

PileUP

getPileUp('SRR390728','16', 88155195, 88155207)


NCBI-Hackathons/SRA2R documentation built on May 7, 2019, 5:18 p.m.