Introduction

Alternative polyadenylation (APA) is one of the most important post-transcriptional regulation mechanisms which is prevalent in Eukaryotes. Like alternative splicing, APA can increase transcriptome diversity. In addition, it defines 3' UTR and results in altered expression of the gene. It is a tightly controlled process and mis-regulation of APA can affect many biological processed, such as uncontrolled cell cycle and growth. Although several high throughput sequencing methods have been developed, there are still limited data dedicated to identifying APA events.

However, massive RNA-seq datasets, which were originally created to quantify genome-wide gene expression, are available in public databases such as GEO and TCGA. These RNA-seq datasets also contain information of genome-wide APA. Thus, we developed the InPAS package for identifying APA from the conventional RNA-seq data.

The major procedures in InPAS workflow are as follows:

In addition, the InPAS package also provide functions to perform quality control over RNA-seq data coverage, visualize differential usage of proximal and distal CP sites for genes of interest, and prepare essential files for gene set enrichment analysis (GSEA) to reveal biological insights from genes with alternative CP sites.

How to run InPAS

First, load the required packages, including InPAS, and species-specific genome and genome annotation database: BSgenome, TxDb and EnsDb.

logger <- file(tempfile(), open = "wt")
sink(logger, type="message")
suppressPackageStartupMessages({
library(InPAS)
library(BSgenome.Mmusculus.UCSC.mm10)
library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(EnsDb.Hsapiens.v86)
library(EnsDb.Mmusculus.v79)
library(cleanUpdTSeq)
library(RSQLite)
library(future.apply)
})

Step 1: set up a SQLite database

Seven tables are created in the database. Table "metadata" stores the metadata, including information for tag (sample name), condition (experimental treatment group), bedgraph_file (paths to BEDGraph files), and depth (whole genome coverage depth) which is initially set to zeros and later updated during analysis. Tables "sample_coverage", "chromosome_coverage", "total_coverage", "utr3_total_coverage", "CPsites", and "utr3cds_coverage" store names of intermediate files and the chromosome and tag (sample name) relevant to the files.

plan(sequential)
data_dir <- system.file("extdata", package = "InPAS")
bedgraphs <- c(file.path(data_dir, "Baf3.extract.bedgraph"), 
               file.path(data_dir, "UM15.extract.bedgraph"))
hugeData <- FALSE
genome <- BSgenome.Mmusculus.UCSC.mm10

tags <- c("Baf3", "UM15")
metadata <- data.frame(tag = tags, 
                      condition = c("Baf3", "UM15"),
                      bedgraph_file = bedgraphs)

## In reality, don't use a temporary directory for your analysis. Instead, use a
## persistent directory to save your analysis output.
outdir = tempdir()
write.table(metadata, file = file.path(outdir =outdir, "metadata.txt"), 
            sep = "\t", quote = FALSE, row.names = FALSE)

sqlite_db <- setup_sqlitedb(metadata = file.path(outdir = outdir, 
                                                 "metadata.txt"),
                           outdir = outdir)

## check the database
db_conn <- dbConnect(drv = RSQLite::SQLite(), dbname = sqlite_db)
dbListTables(db_conn)
dbReadTable(db_conn, "metadata")
dbDisconnect(db_conn)

Step 2: Extracting 3' UTR annotation

3' UTR annotation, including start and end coordinates, and strand information of 3' UTRs, last CDS and the gaps between 3' extremities of 3' UTRs and immediate downstream exons, is extracted using the function extract_UTR3Anno from genome annotation databases: a TxDb database and an Ensembldb database for a species of interest. For demonstration, the following snippet of R scripts shows how to extract 3' UTR annotation from a abridged TxDb for a human reference genome (hg19). In reality, users should use a TxDb for the most reliable genome annotation of the PRIMARY reference genome assembly (NOT including the alternative patches) used for RNA-seq read alignment. If a TxDb is not available for the species of interest, users can build one using the function makeTxDbFromUCSC, makeTxDbFromBiomart, makeTxDbFromEnsembl, or makeTxDbFromGFF from the GenomicFeatures package, depending on the sources of the genome annotation file.

samplefile <- system.file("extdata", 
                          "hg19_knownGene_sample.sqlite",
            package="GenomicFeatures")
TxDb <- loadDb(samplefile)
seqnames <- seqnames(BSgenome.Hsapiens.UCSC.hg19)

# exclude mitochondrial genome and alternative haplotypes
chr2exclude <- c("chrM", "chrMT", seqnames[grepl("_(hap\\d+|fix|alt)$", seqnames, perl = TRUE)])

# set up global variables for InPAS analysis
set_globals(genome = BSgenome.Hsapiens.UCSC.hg19,
            TxDb = TxDb,
            EnsDb = EnsDb.Hsapiens.v86,
            outdir = tempdir(),
            chr2exclude = chr2exclude,
            lockfile = tempfile())
utr3_anno <- 
  extract_UTR3Anno(sqlite_db = sqlite_db,
                   TxDb = getInPASTxDb(),
                   edb = getInPASEnsDb(),
                   genome = getInPASGenome(),
                   outdir = getInPASOutputDirectory(),
                   chr2exclude = getChr2Exclude())

head(utr3_anno$chr1)

This vignette will use the prepared 3' UTR annotation for the mouse reference genome mm10 for subsequent demonstration

## set global variables for mouse InPAS analysis
set_globals(genome = BSgenome.Mmusculus.UCSC.mm10,
            TxDb = TxDb.Mmusculus.UCSC.mm10.knownGene,
            EnsDb = EnsDb.Mmusculus.v79,
            outdir = tempdir(),
            chr2exclude = "chrM",
            lockfile = tempfile())

tx <- parse_TxDb(sqlite_db = sqlite_db,
                 TxDb = getInPASTxDb(),
                 edb = getInPASEnsDb(),
                 genome = getInPASGenome(),
                 outdir = getInPASOutputDirectory(),
                 chr2exclude = getChr2Exclude())

# load R object: utr3.mm10
data(utr3.mm10)

## convert the GRanges into GRangesList for the 3' UTR annotation
utr3.mm10 <- split(utr3.mm10, seqnames(utr3.mm10), 
                   drop = TRUE)

Step 3: reformatting coverage data

Before this step, genome coverage in the BEDGraph format should be prepared from BAM files resulted from RNA-seq data alignment using the genomecov command in the BEDTools suite. BAM files can be filtered to remove multi-mapping alignments, alignments with low mapping quality and so on. Commands for reference are as follows:

```{sh eval = FALSE}

for single end RNA-seq data aligned with STAR

-q 255, unique mapping

samtools view -bu -h -q 255 /path/to/XXX.SE.bam | \ bedtools genomecov -ibam - -bga -split > XXX.SE.uniq.bedgraph

for paired-end RNA-seq data aligned with STAR

samtools view -bu -h -q 255 /path/to/XXX.PE.bam | \ bedtools genomecov -ibam - -bga -split > XXX.PE.uniq.bedgraph

The genome coverage data in the BEDGraph formatis converted into R objects of Rle-class using the `get_ssRleCov` function for each chromosome of each sample. Rle objects for each individual chromosome are save to `outdir`. The filename, tag (sample name), and chromosome name are save to Table "sample_coverage". Subsequently, chromosome-specific Rle objects for all samples are assemble together into a two-level list of Rle objects, with level 1 being the chromosome name and level 2 being Rle for each tag (sample name). Notably, the sample BEDGraph files used here only contain coverage data for "chr6" of the mouse reference genome mm10.

```r
coverage <- list()
for (i in seq_along(bedgraphs)){
coverage[[tags[i]]] <- get_ssRleCov(bedgraph = bedgraphs[i],
                                    tag = tags[i],
                                    genome = genome,
                                    sqlite_db = sqlite_db,
                                    outdir = outdir,
                                    chr2exclude = getChr2Exclude())
}
coverage <- assemble_allCov(sqlite_db,
                            seqname = "chr6",
                            outdir, 
                            genome = getInPASGenome())

At this point, users can check the data quality in terms of coverage for all and expressed genes and 3' UTRs using run_coverageQC. This function output summarized coverage metrics: gene.coverage.rate, expressed.gene.coverage.rate, UTR3.coverage.rate, and UTR3.expressed.gene.subset.coverage.rate. The coverage rate of quality data should be greater than 0.75 for 3' UTRs of expressed genes.

if (.Platform$OS.type == "windows")
{
  plan(multisession)
} else {
  plan(multicore)
}
run_coverageQC(sqlite_db, 
               TxDb = getInPASTxDb(), 
               edb = getInPASEnsDb(), 
               genome = getInPASGenome(),
               chr2exclude = getChr2Exclude(),
               which = GRanges("chr6",
               ranges = IRanges(98013000, 140678000)))
plan(sequential)

Step 4: Identifying potential CP sites

depth weight, Z-score cutoff thresholds, and total coverage along 3' UTRs merged across biological replicates within each condition (huge data) or individual sample (non-huge data) are returned by the setup_CPsSearch function. Potential novel CP sites are identified for each chromosome using the search_CPs function. These potential CP sites can be filtered and/or adjusted using the Naive Bayes classifier provided by cleanUpdTseq and/or by using the polyadenylation scores by simply matching the position-weight matrix (PWM) for the hexamer polyadenylation signal (AAUAAA and the like).

## load the Naive Bayes classifier model for classify CP sites from the 
## cleanUpdTseq package
data(classifier)

prepared_data <- setup_CPsSearch(sqlite_db,
                                 genome = getInPASGenome(), 
                                 chr.utr3 = utr3.mm10$chr6,
                                 seqname = "chr6",
                                 background = "10K",
                                 TxDb = getInPASTxDb(),
                                 hugeData = TRUE,
                                 outdir = outdir,
                                 silence = TRUE,
                                 coverage_threshold = 5)

CPsites <- search_CPs(seqname = "chr6",
                       sqlite_db = sqlite_db,
                       genome = getInPASGenome(),
                       MINSIZE = 10, 
                       window_size = 100,
                       search_point_START = 50,
                       search_point_END = NA,
                       cutEnd = 0,
                       filter.last = TRUE,
                       adjust_distal_polyA_end = TRUE,
                       long_coverage_threshold = 2,
                       PolyA_PWM = NA, 
                       classifier = classifier,
                       classifier_cutoff = 0.8,
                       shift_range = 50,
                       step = 5,
                       outdir = outdir, 
                       silence = TRUE)

Step 5: Estimate usage of proximal and distal CP sites

Estimate usage of proximal and distal CP sites based on read coverage along the short and long 3' UTRs

utr3_cds_cov <- get_regionCov(chr.utr3 = utr3.mm10[["chr6"]],
                              sqlite_db,
                              outdir,
                              phmm = FALSE)

eSet <- get_UTR3eSet(sqlite_db,
                     normalize ="none", 
                     singleSample = FALSE)

Step 6. identifying differential PDUI events

InPAS provides the function test_dPDUI to identify differential usage of proximal and distal CP sites between different conditions leveraging different statistical models according to the experimental design. InPAS offers statistical methods for single sample differential PDUI analysis, and single group analysis. Additionally, InPAS provides Fisher exact test for two-group unreplicated design, and empirical Bayes linear model leveraging the limma package for more complex design. The test results can be further filtered using the filter_testOut function based on the fraction samples within each condition with coverage data for the identified differential PDUI events, and/or cutoffs of nominal p-values, adjusted p-values or log2 (fold change).

test_out <- test_dPDUI(eset = eSet, 
                       method = "fisher.exact",
                       normalize = "none",
                       sqlite_db = sqlite_db)
filter_out <- filter_testOut(res = test_out,
                             gp1 = "Baf3",
                             gp2 = "UM15",
                             background_coverage_threshold = 2,
                             P.Value_cutoff = 0.05,
                             adj.P.Val_cutoff = 0.05,
                             dPDUI_cutoff = 0.3,
                             PDUI_logFC_cutoff = 0.59)

Step 7. Visualizing dPDUI events and preparing files for GSEA

InPAS package also provide functions, get_usage4plot, plot_utr3Usage, and setup_GSEA, to visualize differential usage of proximal and distal CP sites for genes of interest, and prepare essential files for gene set enrichment analysis (GSEA) to reveal biological insights from genes with alternative CP sites.

## Visualize dPDUI events                       
gr <- GRanges("chr6", IRanges(128846245, 128850081), strand = "-")
names(gr) <- "128846245-128850081"
data4plot <- get_usage4plot(gr, 
                            proximalSites = 128849130, 
                            sqlite_db,
                            hugeData = TRUE) 

plot_utr3Usage(usage_data = data4plot, 
               vline_color = "purple", 
               vline_type = "dashed")
## prepare a rank file for GSEA
setup_GSEA(eset = test_out,
           groupList= list(Baf3 = "Baf3", UM15 ="UM15"),
           outdir = outdir,
           preranked = TRUE,
           rankBy = "logFC",
           rnkFilename = "InPAS.rnk")

Session Info

sessionInfo()
sink(type="message")
close(logger)


jianhong/InPAS documentation built on Oct. 27, 2023, 2:13 p.m.