knitr::opts_chunk$set( fig.width = 6, fig.height = 5.5, collapse = TRUE, warning = FALSE, comment = "#>" )
This documentation describes how to read an external file of single-cell poly(A) sites generated by scAPAtrap and analyze it with movAPA.
Actually, PAC list with columns chr/strand/coord and counts can be easily converted as PACdataset
and loaded into movAPA by movAPA::createPACdataset
or readPACds
.
In this document, "PA, PAC, or pA" all are short for a poly(A) site.
movAPA implemented the PACdataset object for storing the expression levels and annotation of PACs from various conditions/samples. Almost all analyses of poly(A) site data in movAPA are based on the PACdataset. The "counts" matrix is the first element in the array list of PACdataset, which stores non-negative values representing expression levels of PACs. The "colData" matrix records the sample information and the "anno" matrix stores the genome annotation or additional information of the poly(A) site data.
The demo_peaks dataset in the movAPA package contains 5w peaks in 2538 cells, including two variables peaks.meta
and peaks.demo
. This file was obtained by scAPAtrap.
library(movAPA) # load demo peaks generated by scAPAtrap, # which contains a list(peaks.meta, peaks.count) data("demo_peaks") peaks=demo_peaks names(peaks) head(peaks$peaks.meta) head(peaks$peaks.count[, 1:10])
Then we can create a PACdataset object.
PACds=createPACdataset(counts=peaks$peaks.count, anno=peaks$peaks.meta) PACds rm(peaks)
This dataset contains many PAs (also called peaks in scAPAtrap), which may not be suitable for downstream analysis. We can first remove extremely lowly expressed peaks.
First, we make summary of the PACdataset. It seems that >50% of PAs with less than 2 reads.
summary(PACds)
Here we remove PAs with \<10 tags supported by all cells and PAs that are expressed in \<10 cells. This filtering removed \~80% PAs.
PACds=subsetPACds(PACds, totPACtag = 10, minExprConds = 10, verbose=TRUE) summary(PACds)
After read the data into a PACdataset, users can use many functions in movAPA for removing internal priming artifacts, annotating PACs, polyA signal analysis, etc. Please follow the vignette of "movAPA_on_rice_tissues" for more details.
For example, users can remove internal priming artifacts with removePACdsIP
. Before starting, it is better to check the chromosome names are consistent between the BSgenome
and PACdataset
. We need make sure the chromosome name of your PAC data is the same as the BSgenome.
library(BSgenome.Hsapiens.UCSC.hg38, quietly = TRUE, verbose = FALSE) bsgenome<-BSgenome.Hsapiens.UCSC.hg38 head(GenomeInfoDb::seqnames(bsgenome)) head(unique(PACds@anno$chr))
PACdsIP=removePACdsIP(PACds, bsgenome, returnBoth=TRUE, up=-10, dn=10, conA=6, sepA=7) length(PACdsIP$real) length(PACdsIP$ip) # summary of IPs and real PAs summary(PACdsIP$real) summary(PACdsIP$ip)
Nearly half of the PACs are internal priming artifacts. We can use the real PAs for further analysis. However, removing internal priming is a non-trifle task, which should be done in caution.
# use real PA for further analysis PACds=PACdsIP$real
Another way to remove internal priming (IP) artifacts is to use known PAs. We can retain those IP sites that supported by known PAs as real. The following example shows how to use known human PAs from PolyA_DB v3 for IP removing.
First we loaded the known PAs.
annodb=read.table("Human_hg38.PolyA_DB.bed", sep="\t", header = FALSE) head(annodb) annodb=annodb[,c(1,2,6)] colnames(annodb)=c("chr","coord","strand") annodb=readPACds(annodb, colDataFile = NULL)
Next, we determine the overlap between PACds' IP and known PAs. The PAs of scAPAtrap is represented by peak, with peak start and end. We believe that if a peak overlaps with any reference PA, then the PA of that peak is considered real.
The findOverapPACds
of movAPA use UPA_Start
and UPA_end
as the column names, so here we first modify the start and end column names of peak to UPA_Start and UPA_end.
PACdsIP$ip@anno$UPA_start=PACdsIP$ip@anno$start PACdsIP$ip@anno$UPA_end=PACdsIP$ip@anno$end PACdsIP$real@anno$UPA_start=PACdsIP$real@anno$start PACdsIP$real@anno$UPA_end=PACdsIP$real@anno$end ## find overlapping IPs IP.ovp=findOvpPACds(qryPACds=PACdsIP$ip, sbjPACds=annodb, d=50, qryMode='region', sbjMode='point', returnNonOvp=TRUE)
Merge IPs supported by reference PAs and the original real PAs as the PAs for subsequent analysis.
anno=rbind(PACdsIP$real@anno, IP.ovp$ovp@anno) count=rbind(PACdsIP$real@counts, IP.ovp$ovp@counts) PACds=createPACdataset(counts=count, anno=anno) summary(PACds)
library(TxDb.Hsapiens.UCSC.hg38.knownGene, quietly = TRUE, verbose = FALSE) txdb=TxDb.Hsapiens.UCSC.hg38.knownGene # annotate PAs PACds=annotatePAC(PACds, txdb) # after annotation, the gene and ftr information are present in PACds@anno. summary(PACds)
Genes with or without annotated 3'UTR could be assigned an extended 3'UTR of a given length using the function ext3UTRPACds
, which can improve the "recovery" of poly(A) sites falling within authentic 3'UTRs.
Before extending, we can calculate the number of PACs falling into extended 3'UTRs of different lengths.
testExt3UTR(PACds, seq(1000, 10000, by=1000))
You can take a look at the length of the 3'UTRs PACds again to make a rough judgment. Here, we only select the length of the 3'UTRs where PA is located for approximate calculation.
utrid=which(PACds@anno$ftr=='3UTR') utrs=unique(PACds@anno[utrid, c('ftr_end','ftr_start')]) summary(utrs$ftr_end-utrs$ftr_start+1)
Here we extended 3'UTR length for 2000 bp. After extension, 100+ PACs in intergenic region are now in extended 3'UTRs.
# extend 3UTR by 1000bp PACds=ext3UTRPACds(PACds, 1000) # after 3UTR extension summary(PACds)
For single cell data, we suggest analyzing only 3'UTR PAs and discarding PAs from other regions.
PACds=PACds[PACds@anno$ftr=='3UTR'] summary(PACds)
The function plotATCGforFAfile is for plotting single nucleotide profiles for given fasta file(s), which is particularly useful for discerning base compositions surrounding PACs.
First trim sequences surrounding PACs. Sequences surrounding PACs in different genomic regions are extracted into files. The PAC position is 301.
faFiles=faFromPACds(PACds, bsgenome, what='updn', fapre='updn', up=-300, dn=100, byGrp='ftr')
Then plot base compositions for specific sequence file(s).
plotATCGforFAfile(faFiles, ofreq=FALSE, opdf=FALSE, refPos=301, mergePlots = TRUE)
movAPA provides several functions, including annotateByPAS, faFromPACds, kcount, and plotATCGforFAfile, for sequence extraction and poly(A) signal identification.
Here we show another example to scan known human polyA signals. First, we get mouse signals and set the priority.
v=getVarGrams('mm') priority=c(1,2,rep(3, length(v)-2))
Then scan upstream regions of PACs for mouse signals.
PACdsMM=annotateByPAS(PACds, bsgenome, grams=v, priority=priority, from=-50, to=-1, label='mm') table(PACdsMM@anno$mm_gram) ## percent of PA without PAS noPAS=sum(is.na(PACdsMM@anno$mm_gram)) noPAS/length(PACdsMM)
Plot signal logos.
pas=PACdsMM@anno$mm_gram[!is.na(PACdsMM@anno$mm_gram)] plotSeqLogo(pas)
The function mergePACds can also be used to merge multiple PACdatasets. Notably, the annotation columns (e.g., gene, ftr) are lost after merging, you need call annotatePAC to annotate the merged PACds.
Here we show a multi-sample merging using the reference PA. We directly copy one more PACds for this example.
## the two PACds for merging PACdsList=list(ds1=PACds, ds2=PACds)
pacds.merge=mergePACds(PACdsList, d=24, by='coord') # summary of PACds#1 summary(PACds) # summary of the merged PACds summary(pacds.merge) head(pacds.merge@counts[, 1:10]) head(pacds.merge@anno)
In movAPA 0.2.0, a reference PACds can be used for merging PACdsList in a smarter way. Providing reference PACds for merging is useful when there are multiple large PAC lists to be merged, which can prevent generating PACs with a very wide range. If there is reference PACs from 3'seq, it is recommended to use it. Please see the help document of mergePACds for details.
Given a reference PACds, buildRefPACdsAnno
can be used to combine multiple PACds to build a more complete reference.
First we loaded the known PAs.
annodb=read.table("./Human_hg38.PolyA_DB.bed", sep="\t", header = FALSE) head(annodb) annodb=annodb[,c(1,2,6)] colnames(annodb)=c("chr","coord","strand") annodb=readPACds(annodb, colDataFile = NULL)
## we can build a reference PACds with human known PA and the given PACds list ## or we can use annodb only refPA=buildRefPACdsAnno(refPACds=annodb, PACdsList=PACdsList, by='coord', d=24, min.counts = 50, min.smps=10, max.width=500, verbose=TRUE)
Because the two PACds merged for this example are the same, the number of PACs remains unchanged after merging, but the sample size doubles.
pacds.merge=mergePACds(PACdsList, d=24, by='coord', refPACds=refPA) # summary of PACds#1 summary(PACds) # summary of the merged PACds summary(pacds.merge) head(pacds.merge@counts[, 1:10]) head(pacds.merge@anno)
The original annotation information of the merged data will be removed and needs to be re-annotated.
# annotate PAs pacds.merge=annotatePAC(pacds.merge, txdb) # after annotation, the gene and ftr information are present in PACds@anno. summary(pacds.merge)
Get APA index using the smart RUD method (available in movAPA v0.2.0).
The smartRUD indicator is provided in movAPA v0.2.0, and it is recommended to use it. Pay attention to setting clearPAT=1 to remove cases of PAs with only 1 read count. At the same time, check the distance between the two PAs first to select a suitable dist for filtering the proximal and distal PAs of 3'UTR.
s=getNearbyPAdist(PACds)
You can see that the average distance is 10K nt, but the median distance is only 3000 nt, so setting 5000 nt is probably enough.
# get proximal and distal PA for each gene pd=get3UTRAPApd(pacds=PACds, minDist=50, maxDist=5000, minRatio=0.05, fixDistal=FALSE, addCols='pd') # get smartRUD index rud=movAPAindex(pd, method="smartRUD", sRUD.oweight=TRUE, clearPAT=1) head(rud$rud[, 1:5]) head(rud$weight[, 1:5]) ## we can also calculate the RUD index ## rud1=movAPAindex(pd, method="RUD")
The vizAPA package is a comprehensive package for visualization of APA dynamics in single cells. vizAPA imports various types of APA data and genome annotation sources through unified data structures. vizAPA also enables identification of genes with differential APA usage (also called APA markers). Four unique modules are provided in vizAPA for visualizing APA dynamics across cell groups and at the single-cell level.
Here we use vizAPA to show the gene structure and PA location.
library(vizAPA, quietly = TRUE) # to choose one gene, first we get the total counts of each PA PACds@anno$TN=rowSums(PACds@counts) pd@anno$TN=rowSums(pd@counts) # construct the genome annotation object annoSource=new("annoHub") annoSource=addAnno(annoSource, txdb) gene=pd@anno$gene[1] vizTracks(gene=gene, PACds.list=list(pA=PACds, pd=pd), PA.show=c("pos","TN"), annoSource=annoSource, PA.columns="coord", PA.width=10, space5=1000, space3=1000)
The gene IDs between different annotations may not be consistent, and we can also use vizAPA to map different IDs.
library(Homo.sapiens) orgdb=Homo.sapiens genes=getAnnoGenes(orgdb) head(genes) genes[genes$gene_entrezid==gene, ] genes[genes$gene_entrezid==10003, ]
The session information records the versions of all the packages used in the generation of the present document.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.