1. Overview

    Here, we will use the PACBIO sequencing data from the mouse GV oocytes cell line published by Yusheng Liu et al. to demonstrate the key functions of PolyAtailor. This includes tail quantification without reference genome, reference genome-based tail quantification, PA site identification and annotation, analysis of base composition in Poly(A) tails and analysis of the significance of differences in Poly(A) tail length under different conditions.

  2. Preparation

    First, the GV rep1 data (SRR8798075) was downloaded from the NCBI database. The fastq format file contained a total of 63,732 reads, which are all ccs-processed reads and can be used directly for analysis. The tool for sequence alignment is minimap2 (with default parameters) and the reference genome is GRCm38(mm10). Once you have done this you can start using PolyAtailor:

# require(devtools)
# install_github("XHWUlab/PolyAtailor")
  1. Quantitative poly(A) tail with tailScan

    All analyses will start with quantitativing the poly(A) tail. The input to this step is a fastq format file containing reads that have been processed by ccs, and the output will be a dataframe containing all the sequence tail information, Include at least read_num,tail,PAL,chr,strand,tailType,read_type and sample. The median length of all tails found by tailScan is 57nt, which is not far from the 59nt reported in the original article.

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = F,
  fig.width = 6, 
  fig.height = 6,
  eval=FALSE
)
library(polyAtailor)
fastqfile <- system.file("extdata", "GV_fastq/PAIso_GV1.zip", package = "PolyAtailor", mustWork = TRUE)
unzip(fastqfile, exdir="D:/")
fastqfile <- "D:/PAIso_GV1.fastq"
library(ShortRead)
GV1tailDF<-tailScan(fastqfile,mcans=5,findUmi = F,resultpath = "./data/output/",samplename = "GV1",tailAnchorLen=8,minTailLen=8,realTailLen=20,maxNtail=2,mapping=F)
head(GV1tailDF)
#>read_num strand PAL tail tailType read_type nA rt
#SRR8798075.11991 - 9  TTTTTTTTA     structural two-tail-mixed 8  0.89 GV1
#SRR8798075.11991 + 24 TTTTT...ATTGT structural two-tail-mixed 19 0.79 GV1
#SRR8798075.12051 - 13 TTTTTTTTTTTTT structural two-tail-mixed 13 1.00 GV1
#SRR8798075.12051 + 57 TTTTT...TTTTG structural two-tail-mixed 56 0.98 GV1
#SRR8798075.12545 - 28 TTTTT...TGCTT structural two-tail-mixed 21 0.75 GV1
#SRR8798075.12545 + 15 TTTTT...GATCT structural two-tail-mixed 12 0.80 GV1
median(GV1tailDF$PAL)
# [1] 57
  1. Quantitative poly(A) tail with tailMap

All analyses will start with quantitativing the poly(A) tail. The input to this tailMap step is a bam file after alignment, and the output will be a dataframe containing all the sequence tail information, Include at least read_num,tail,PAL,chr,coord,strand,tailType,read_type and sample.There are four steps in the tail extraction using the alignment method. The median length of all tails found by tailScan is 59nt.

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = F,
  fig.width = 6, 
  fig.height = 6,
  eval=FALSE
)
#step1
fastqfile <- system.file("extdata", "GV_fastq/PAIso_GV1.zip", package = "PolyAtailor", mustWork = TRUE)
unzip(fastqfile, exdir="D:/")
fastqfile <- "D:/PAIso_GV1.fastq"
library(seqRFLP)
faBuilderRE <- faBuilder(fastqfile,mcans=5,findUmi = F,resultpath = "./data/output/",samplename = "GV1",tailAnchorLen=8,mapping=F)

#step2
  ##alignment with any aligner.

#step3
bamfile <- system.file("extdata", "./GV_algin/GV1subseq.sorted.bam", package = "PolyAtailor", mustWork = TRUE)
GV1tailMapre<-tailMap(bamfile,mcans=5,minTailLen=8,findUmi = F,longRead=T)
head(GV1tailMapre)
# read_num                   chr strand coord PAL
# 1  SRR8798075.1004  ENSMUST00000009039.6      -   326  69
# 2 SRR8798075.10180 ENSMUST00000100052.11      -  3273   8
# 3 SRR8798075.10187  ENSMUST00000141115.8      +  9913  10
# 4 SRR8798075.10190  ENSMUST00000139787.8      -   296  40
# 5 SRR8798075.10378 ENSMUST00000055131.13      -  3224  63
# 6 SRR8798075.10430  ENSMUST00000209034.2      -  1661  57
#                                                                    tail
# 1 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
# 2                                                              TTTTTTTT
# 3                                                            TTTTTTTTTT
# 4                              TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
# 5       TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTATT
# 6             TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
#     tailType read_type nA       rt sample
# 1 structural  one-tail 69 1.000000    GV1
# 2 structural  one-tail  8 1.000000    GV1
# 3 structural  one-tail 10 1.000000    GV1
# 4 structural  one-tail 40 1.000000    GV1
# 5 structural  one-tail 62 0.984127    GV1
# 6 structural  one-tail 57 1.000000    GV1

#step4
BiocManager::install("TxDb.Mmusculus.UCSC.mm10.knownGene")
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
data(GV1tailDF)
data(GV1tailMapre)
AnnotedTails = geneAnno(tailDF=GV1tailDF,bamdf=GV1tailMapre,GFF=TxDb.Mmusculus.UCSC.mm10.knownGene,longRead=F)
head(AnnotedTails)
# read_num   chr strand  gene gene_type PAL
# 1 SRR8798075.10543  chr3      - 72007      <NA>  11
# 2 SRR8798075.10543  chr3      - 72007      <NA>  17
# 3 SRR8798075.11063 chr10      + 69412      <NA>  89
# 4 SRR8798075.13407 chr14      - 12159      <NA>  89
# 5 SRR8798075.14401 chr14      - 12159      <NA>  43
# 6 SRR8798075.14401 chr14      - 12159      <NA> 119
# tail
# 1 TTTTTTTTTTT
# 2 TTTTTTTTTTTTTTTTT
# 3 TTTTTTT...TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
# 4 TTTTTTT...TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
# 5 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGCT
# 6 TTTTTTT...TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTGCTTTGCTTTACTTTC
#       tailType      read_type  nA        rt sample
# 1   structural two-tail-mixed  11 1.0000000    GV1
# 2   structural two-tail-mixed  17 1.0000000    GV1
# 3   structural       one-tail  89 1.0000000    GV1
# 4   structural       one-tail  89 1.0000000    GV1
# 5   structural  two-tail-same  41 0.9534884    GV1
# 6 unstructural  two-tail-same 111 0.9327731    GV1
  1. PA site detection and annotation

    In order to be able to analyze the APA sites of HELA later, we need to identify the PA sites in the sequence and annotate the PA sites we have identified using human annotation information. At the same time, this step will map the base distribution around the PA site and store it as a PDF file in the path you specify.

[NOTE] Please note that the chromosome name of the reference genome should be the same as the chromosome name used in the BAM file, otherwise it will result in an error!

# Deciphering gff files
library(movAPA)
BiocManager::install("TxDb.Mmusculus.UCSC.mm10.knownGene")
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
# Deciphering genome files
BiocManager::install("BSgenome.Mmusculus.UCSC.mm10")
library("BSgenome.Mmusculus.UCSC.mm10")
bsgenome = BSgenome.Mmusculus.UCSC.mm10
# Prepare the input file
bamfilepath = system.file("extdata", "./GV_algin/PAIso-GV1.sorted.bam", package = "PolyAtailor", mustWork = TRUE)
chrinfopath = system.file("extdata", "./GV_algin/chrinfo.txt", package = "PolyAtailor", mustWork = TRUE)
resultpath = "D:/TEST/"
# Annotated PA site
PAs <- findAndAnnoPAs(bamfile=bamfilepath,chrinfo=chrinfopath,resultpath=resultpath,bsgenome=bsgenome,gffFile = TxDb.Mmusculus.UCSC.mm10.knownGene,sample="GV1",mergePAs=T,d=24)

MarineGEO circle logo

  1. distribution of poly(A) tail length

    Using the PoltPAL family of functions, the tail length distribution of the gene or UMI of different sample will be plotted, as well as the global tail length distribution.
    The function plotPALDistribution will plot the global distribution of tail length under different samples, when the parameter dType is equal to "global". When the parameter "dType" is "gene", this function plots the PAL per gene distribution density curve, and when the parameter is "umi", it plots the PAL distribution density curve of PAL per umi.

library(ggthemes)
library(eoffice)
data(AnnotedTails)
p1 <- plotPALDistribution(AnnotedTails,"./data/figures/","global",medianPAL=T)
p1
p2 <- plotPALDistribution(AnnotedTails,"./data/figures/","gene",medianPAL=T)
p2


  1. PA sites analysis

The statistics of PA sites include three aspects:

data(PAs)
p <- plotPADistribution(PAs,"./data/figures/","#9BBFDC")
p


data(PAs)
p <- plotGenePAnumbers(PAs,"./data/figures/","#DF7C7D")
p


library("BSgenome.Mmusculus.UCSC.mm10")
bsgenome = BSgenome.Mmusculus.UCSC.mm10
data(PAs)
p <- plotPASignals(PAs,"./data/figures/",bsgenome = bsgenome)
p


  1. Base composition analysis

    Polyatailor contains abundant analysis and visualization tools of poly(A) tail base composition, including statistics of the number distribution of reads in different non-A base combinations, analysis of non-A base content in polyAtail of different lengths, and direct visual comparative analysis of intrested tails. All you need to do is enter a table that contains at least the genetic information and the complete tail sequence and all results will be saved in a PDF file. Of course, you can do each analysis separately.

  2. Frequency of occurrence of different non-A base combinations
    The number of reads present in each non-A base combination was calculated in the form of an upset plot, and reads present in multiple non-A base combinations were highlighted, The result can be called directly in the output of the nonAanalysis function. For example, in the tails of GV data, most of the tails contain only one particular non-A base, but there are also a number of tails in which all three non-A bases are present at the same time:

library(stringi)
data(AnnotedTails)
re <- nonAanalysis(AnnotedTails)
re$p1


re$p2
re$p3


library(ggmsa)
library(seqRFLP)
data(taildf)
my_cutstom <- data.frame(names=c("A","C","T","G"),color=c("#3171A5","#4EAA4C","#C9C4C2","#D73D3D"))
p <- tailViso(taildf,tailLen=100,Ntail=20,custom=my_cutstom,strand="-",faPath="D:/",showLogo=T,showReadNum= F)


  1. Detection of PAL difference significance

    The "PALdsa" function of PolyAtailor package can be used to detect the significant difference of tail length in batch at gene or PA level. We offer four ways to perform the significance of differences in tail length, namely the Mann-Whitney U test, the Kolmogorov-Smimov test, MosesExtreme reaction and the two-sided wilcoxon test, user can select one or all of these for calculation using the parameter "SAoDMethod". In addition to this, the user can also select the visualisation method using the parameters "withViolinPlot" and "withUpsetPlot". However, in the GV data, we only detected significant differences in PAL at the distal and proximal PA sites for 1 gene.

BiocManager::install("TxDb.Mmusculus.UCSC.mm10.knownGene")
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
gff <- parseGenomeAnnotation(TxDb.Mmusculus.UCSC.mm10.knownGene)
data(AnnotedTails)
files = system.file("extdata", "./output/PAs/PAs.txt", package = "PolyAtailor", mustWork = TRUE)
PAs <- read.table(files,header=TRUE,sep=" ")
diffPAL2PAgenes <- PALdsa(PAs,AnnotedTails,gff,mode="PD",SAoDMethod="ME",withViolinPlot=T,withUpsetPlot=F)



References
Ye W, Liu T, Fu H, et al. movAPA: Modeling and visualization of dynamics of alternative polyadenylation across biological samples[J]. Bioinformatics, 2021, 37(16): 2470-2472.
Liu Y, Nie H, Liu H, et al. Poly (A) inclusive RNA isoform sequencing (PAIso− seq) reveals wide-spread non-adenosine residues within RNA poly (A) tails[J]. Nature communications, 2019, 10(1): 1-13.
Legnini I, Alles J, Karaiskos N, et al. FLAM-seq: full-length mRNA sequencing reveals principles of poly (A) tail length control[J]. Nature methods, 2019, 16(9): 879-886.



XHWUlab/PolyAtailor documentation built on Dec. 28, 2021, 12:13 a.m.