In asrinivasan-oa/gread: Fast Reading and Processing of Common Gene Annotation and Next Generation Sequencing Format Files

require(gread)
knitr::opts_chunk$set(
  comment = "#",
  error = FALSE,
  tidy = FALSE,
  cache = FALSE,
  collapse=TRUE)
# options(datatable.auto.index=FALSE)

We've developed three packages for performing differential analysis of NGS data, namely gread, gcount and ganalyse. In short,

gread enables loading or reading in the required data quickly from many different formats in which NGS data and gene annotations are available.
gcount counts the reads depending on user configuration on raw counts.
ganalyse then allows to perform differential gene expression analysis using many methods such as limma, voomlimma (for FPKM), edger on the read counts.

In this vignette, we'll discuss the gread package.

gread package

gread is an R-package that allows to read and process gene annotations and NGS data. Currently supported formats include:

Gene Feature Format or gff,
Gene Transfer Format or gtf (also referred to as gff v2.5 at times),
Browser Extensible Data or bed format, and
Binary sequence alignment/map Format or bam format

You can use the function supported_formats() to check all available formats.

supported_formats()

Reading files quickly

There are four functions, one for each type of format.

`read_gtf`

(gtf = read_gtf("sample.gtf"))
class(gtf)

`read_gff`

(gff = read_gff("sample.gff"))
class(gff)

`read_bed`

(bed = read_bed("sample.bed")) ## bed9, bed12 and bed15 formats
class(bed)

`read_bam`

(bam = read_bam("sample.bam"))
class(bam)

{.bs-callout .bs-callout-info}

All functions have a set of common arguments, and might contain additional arguments specific to that particular format. See ?<function_name> for more details.
You can filter invalid rows by using the filter argument. Default value is FALSE, i.e., load all rows.
There are some additional arguments for read_bam function to load extra tags. If you are not familiar with it, it's better to leave this argument untouched.
All these functions are objects of class gtf, gff bed and bam respectively. They inherit from data.table.

`read_format`

While these functions are quite easy and straightforward to use, there's another function read_format() that makes things even simpler and is sufficient in most cases.

# gtf file
(gtf = read_format("sample.gtf"))

# gff file
(gff = read_format("sample.gff"))

# bam file
(bam = read_format("sample.bam"))

# bed file
(bed = read_format("sample.bed"))

{.bs-callout .bs-callout-info}

The function recognises the file format automatically with proper defaults. So for most cases with default scenarios, this is quite useful.

Extracting various features

The gread() package provides an useful function extract() to extract useful features from gtf and gff objects. This vignette provides a short description of the function. See ?extract for more.

In a typical RNA-seq experiment, we would want to extract the genes start-end coordinates and also just the exons. This is because, the default methodology is to extract the reads that overlap exons and count the total number of reads that overlap only those exonic regions for that gene. Extracting genes and exons into separate data.tables makes things easier to obtain counts then using the gcount package.

gtf = read_format("sample.gtf")
## Extract a unified gene model by combining all overlapping exons
(uni_genes = extract(gtf, feature="gene_exon", type="union", 
                 transcript_id="transcript_id", gene_id="gene_id"))
## extract genes as such
(genes = extract(gtf, feature="gene", type="default"))

Other useful functionalities

Construct intron coordinates

Sometimes the gene annotation formats do not have intronic coordintes annotated but might be essential. In that case, we can use the construct_introns function on a gtf or gff object a follows:

gtf = read_format("sample.gtf")
(gtf = construct_introns(gtf, update=TRUE))
(introns = construct_introns(gtf, update=FALSE))

class(gtf)
class(introns)

Note that: {.bs-callout .bs-callout-info}

By default, the argument update is equal to TRUE, which returns an updated gtf object with the intron coordinates added to gtf object. The class of the input object is retained in result.
If you require just the introns, set update = FALSE.

asrinivasan-oa/gread documentation built on May 10, 2019, 2:04 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

asrinivasan-oa/gread
Fast Reading and Processing of Common Gene Annotation and Next Generation Sequencing Format Files

In asrinivasan-oa/gread: Fast Reading and Processing of Common Gene Annotation and Next Generation Sequencing Format Files

gread package

Reading files quickly

`read_gtf`

`read_gff`

`read_bed`

`read_bam`

{.bs-callout .bs-callout-info}

`read_format`

{.bs-callout .bs-callout-info}

Extracting various features

Other useful functionalities

Construct intron coordinates

Note that: {.bs-callout .bs-callout-info}

R Package Documentation

Browse R Packages

We want your feedback!

asrinivasan-oa/gread Fast Reading and Processing of Common Gene Annotation and Next Generation Sequencing Format Files

In asrinivasan-oa/gread: Fast Reading and Processing of Common Gene Annotation and Next Generation Sequencing Format Files

gread package

Reading files quickly

read_gtf

read_gff

read_bed

read_bam

{.bs-callout .bs-callout-info}

read_format

{.bs-callout .bs-callout-info}

Extracting various features

Other useful functionalities

Construct intron coordinates

Note that: {.bs-callout .bs-callout-info}

R Package Documentation

Browse R Packages

We want your feedback!

asrinivasan-oa/gread
Fast Reading and Processing of Common Gene Annotation and Next Generation Sequencing Format Files

`read_gtf`

`read_gff`

`read_bed`

`read_bam`

`read_format`