InPAS Vignette

suppressPackageStartupMessages({
    library(InPAS)
    library(BSgenome.Mmusculus.UCSC.mm10)
    library(TxDb.Mmusculus.UCSC.mm10.knownGene)
    library(org.Hs.eg.db)
    library(cleanUpdTSeq)
})

Introduction

Alternative polyadenylation (APA) is one of the most important post-transcriptional regulation mechanisms which is prevalent in human genes. Like alternative splicing, APA can create proteome diversity. In addition, it defines 3' UTR and results in altered expression of the gene. It is a tightly controlled process and mis-regulation of APA can lead to pathological effects to the cells, such as uncontrolled cell cycle and growth. Although several high throughput sequencing methods have been developed, there are still limited data available.

RNA-seq has become one of the most commonly used methods for quantifying genome-wide gene expression. There are massive RNA-seq datasets available in public databases such as GEO and TCGA. For this reason, we develop the InPAS algorithm for identifying APA from conventional RNA-seq data.

The workflow for InPAS is:

How to

First, load the required libraries InPAS, species specific BSgenome, and TxDb as follows.

library(InPAS)
library(BSgenome.Mmusculus.UCSC.mm10)
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
path <- file.path(find.package("InPAS"), "extdata")

Next, prepare annotation using the function utr3Annotation with a TxDb and org annotation. Please note that the 3' UTR annotation prepared by utr3Annotation includes the gaps to the next annotated region.

library(org.Hs.eg.db)
samplefile <- system.file("extdata", "hg19_knownGene_sample.sqlite",
            package="GenomicFeatures")
txdb <- loadDb(samplefile)
utr3.sample.anno <- utr3Annotation(txdb=txdb, 
                   orgDbSYMBOL="org.Hs.egSYMBOL")
utr3.sample.anno

Users can also directly load the 3' UTR annotation included in this package for mm10 and hg19. Here we show how to load the pre-built mm10 3' UTR annotation file.

##step1 annotation
# load from dataset
data(utr3.mm10)

For coverage calculation, alignment files in BEDGraph format are required, which can be generated from BAM files using the genomecov tool in bedtools with parameter: -bg -split. Potential pA sites identified from the coverage data can be filtered/adjusted using the classifier provided by cleanUpdTseq. The following scripts illustrate the function calls needed to perform the complete analysis using InPAS.

load(file.path(path, "polyA.rds"))
library(cleanUpdTSeq)
data(classifier)
bedgraphs <- c(file.path(path, "Baf3.extract.bedgraph"), 
               file.path(path, "UM15.extract.bedgraph"))
hugeData <- FALSE
##step1 Calculate coverage
coverage <- coverageFromBedGraph(bedgraphs, 
                                 tags=c("Baf3", "UM15"),
                                 genome=BSgenome.Mmusculus.UCSC.mm10,
                                 hugeData=hugeData)

## we hope the coverage rate of should be greater than 0.75 in the expressed gene.
## which is used because the demo data is a subset of genome.
coverageRate(coverage=coverage,
             txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
             genome=BSgenome.Mmusculus.UCSC.mm10,
             which = GRanges("chr6", ranges=IRanges(98013000, 140678000)))

##step2 Predict cleavage sites
CPs <- CPsites(coverage=coverage, 
               genome=BSgenome.Mmusculus.UCSC.mm10,
               utr3=utr3.mm10, 
               search_point_START=200, 
               cutEnd=.2, 
               long_coverage_threshold=3,
               background="10K",
               txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
               PolyA_PWM=pwm, 
               classifier=classifier,
               shift_range=50,
               step=10)
head(CPs)
##step3 Estimate 3UTR usage
res <- testUsage(CPsites=CPs, 
                 coverage=coverage, 
                 genome=BSgenome.Mmusculus.UCSC.mm10,
                 utr3=utr3.mm10, 
                 method="fisher.exact",
                 gp1="Baf3", gp2="UM15")

##step4 view the results
as(res, "GRanges") 

filterRes(res, gp1="Baf3", gp2="UM15",
          background_coverage_threshold=3,
          adj.P.Val_cutoff=0.05, 
          dPDUI_cutoff=0.3, 
          PDUI_logFC_cutoff=0.59)

The steps described above can be done in one function call.

if(interactive()){
    res <- inPAS(bedgraphs=bedgraphs, tags=c("Baf3", "UM15"), 
              genome=BSgenome.Mmusculus.UCSC.mm10, 
              utr3=utr3.mm10, gp1="Baf3", gp2="UM15",
              txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
              search_point_START=200,
              short_coverage_threshold=15,
              long_coverage_threshold=3, 
              cutStart=0, cutEnd=.2, 
              hugeData=FALSE,
              shift_range=50,
              PolyA_PWM=pwm, classifier=classifier,
              method="fisher.exact",
              adj.P.Val_cutoff=0.05,
              dPDUI_cutoff=0.3, 
              PDUI_logFC_cutoff=0.59)
}

InPAS can also handle single group data.

inPAS(bedgraphs=bedgraphs[1], tags=c("Baf3"), 
      genome=BSgenome.Mmusculus.UCSC.mm10, 
      utr3=utr3.mm10, gp1="Baf3", gp2=NULL,
      txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
      search_point_START=200,
      short_coverage_threshold=15,
      long_coverage_threshold=3, 
      cutStart=0, cutEnd=.2, 
      hugeData=FALSE, 
      PolyA_PWM=pwm, classifier=classifier,
      method="singleSample",
      adj.P.Val_cutoff=0.05,
      step=10)

Session Info

sessionInfo()


Try the InPAS package in your browser

Any scripts or data that you put into this service are public.

InPAS documentation built on Nov. 8, 2020, 5:03 p.m.