Zhaolian Lu, Keenan Berry, Zhenbin Hu, Yu Zhan, Tae-Hyuk Ahn, Zhenguo Lin
TSSr was published in NAR Genomics and Bioinformatics in 2021, please cite:
Documentation is also available on GitHub Pages: https://github.com/Linlab-slu/TSSr
TSSr is an R package that provides rich functions for mapping transcription start sites (TSSs) and characterizations of structures and activities of core promoters based on various types of sequencing data generated by high-throughput techniques that sequence the very 5’end of RNA transcripts (TSS sequencing), such as cap analysis of gene expression (CAGE),nAnT-iCAGE, NanoCAGE, transcript leader sequencing (TL-seq), transcript isoform sequencing (TIF-seq), TSS-seq, RAMPAGE, single-cell tagged reverse transcription (STRT), global nuclear run-on cap (GRO-cap) and STRIPE-seq.
The main fucntions of TSSr include identifications of TSSs, normalization and filtration of TSS signals, clustering of TSS to infer core promoters, quantifications of core promoter activities and shape, assigning TSS cluster to genes, gene differential expression, core promoter shift. TSSr uses multiple formats of files as input, such as Binary Sequence Alignment Mao (BAM) files (single-ended or paired-ended), Browser Extension Data (bed) files, BigWig files, or TSS tables. TSSr generates various types of TSS or core promoter track files which can be visualized in the UCSC Genome Browser or Integrative Genomics Viewer (IGV). TSSr also exports various raw and processed result tables and visualization plots. Multiple cores are supported on Linux or Mac OS platforms.
install the packages by using the following R command:
R
install.packages("devtools")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Rsamtools")
BiocManager::install("GenomicRanges")
BiocManager::install("GenomicFeatures")
BiocManager::install("Gviz")
BiocManager::install("rtracklayer")
BiocManager::install("DESeq2")
BiocManager::install("BSgenome")
R
install.packages("data.table")
install.packages("stringr")
conda create -n tssr
conda activate tssr
conda install r-base=4.1.0
conda install -c bioconda -c conda-forge bioconductor-rsamtools bioconductor-gviz \
bioconductor-genomicranges bioconductor-genomicfeatures bioconductor-rtracklayer \
bioconductor-deseq2 bioconductor-bsgenome r-data.table r-stringr r-devtools pandoc
If you running into some errors when trying install all the packages in one line, please install the packages one by one manually.
You can also install the dependency packages from yml file
wget -c https://raw.githubusercontent.com/Linlab-slu/TSSr/master/tssr.yml
conda create -n tssr -f tssr.yml
To install the TSSr package all the prerequisites above need to be installed. After confirming those packages are installed, you can install the development version directly from GitHub using devtools:
devtools::install_github("Linlab-slu/TSSr", build_vignettes = TRUE)
TSSr accepts two types of input data: read alignment files or TSS tables. The read alignment files in compressed binary alignment map (BAM) format are required if users intend to identify TSSs from mapped sequencing data. BAM files can be derived from mapping of either paired-end or single-end TSS sequencing reads. A TSS table is a matrix of TSS data generated by TSSr or other bioinformatics tools. A TSS table can be a tab-delimited text file, a BigWig (bw) binary type file, or browser extensible data (bed) type file. A tab-delimited TSS table contains chromosome ID, genomic coordinates, strand information, and raw or normalized read counts of each sample. An example TSS table can be downlaoded from http://zlinlab.org/TSSr/ALL.samples.TSS.raw.txt.
The reference genome stored as a BSgenome data package must be provided for identification of TSSs from bam files. For available BSgenome genome objects, please check (https://bioconductor.org/packages/BSgenome). In case the BSgenome object of a species is unavailable in the BSgenome package (not in the list returned by BSgenome::available.genomes() function), a custom BSgenome object has to be built and installed before running this package. (See the vignette “How to forge a BSgenome data package” within the BSgenome package for instructions). A genome annotation file (GTF or GFF file) is required for assigning TSS clusters to genes or other genomic features by using the annotateCluster function of TSSr. The chromosome names in BSgenome, genome annotation file as well as the reference genome used for mapping sequencing reads must be consistent.
TSSr uses S4 object to store all input files and arguments and generate downstream analysis results.
Launch TSSr
library(TSSr)
Example TSSr object. Users may explore the structure of a TSSr object and TSSr fucntions using provided example TSSr object "exampleTSSr". Skip this step if the user plan to generate a new TSSr object from bam or TSS table files.
data(exampleTSSr)
myTSSr <- exampleTSSr
Generating a new TSSr object from BAM or TSS table files. To generate a new TSSr object, several arguments of the constructer "new" function must be specified. The first step is to provide the path and file names of input files for argument "inputFiles". Four example BAM files (S01.sorted.bam, S02.sorted.bam, S03.sorted.bam, S04.sorted.bam) can be downloaded from http://www.zlinlab.org/TSSr.html. These BAM files were generated using nAnT-iCAGE from a budding yeast Saccharomyces cerevisiae (Lu and Lin 2019). The nAnT-iCAGE reads were mapped to the reference genome of S. cerevisiae (R64-2-1) using HISAT2 (Kim, Langmead et al. 2015). To reduced sample file size, each BAM file only includes sequencing reads mapped to two chromosomes (Chr I and Chr II). A TSS table genereated from the four BAM files can be downloaded from http://zlinlab.org/TSSr/ALL.samples.TSS.raw.txt, which can also be used as a input file. Assuming the bam or TSS files are saved to the current working directory:
inputFiles <- c("S01.sorted.bam", "S02.sorted.bam", "S03.sorted.bam", "S04.sorted.bam") # if use a TSS table: inputFiles <- "ALL.samples.TSS.raw.txt"
inputFilesType <- "bam" # set “inputFilesType” as “bamPairedEnd” for paired-end BAM files, and as "TSStable" if the input file is a TSS table
Install and launch required BSgenome object. The corresponding BSgenome object of S. cerevisiae “BSgenome.Scerevisiae.UCSC.sacCer3” is available from BSgenome Bioconductor package (https://bioconductor.org/packages/BSgenome).
if (!requireNamespace("BSgenome.Scerevisiae.UCSC.sacCer3", quietly = TRUE))
BiocManager::install("BSgenome.Scerevisiae.UCSC.sacCer3")
library(BSgenome.Scerevisiae.UCSC.sacCer3)
Provide path and file name of genome annotation file. The example genome annotation of S. cerevisiae (R64-2-1) was originally obtained from the Saccharomyces Genome Database. It ban be downlaoded from http://zlinlab.org/TSSr/saccharomyces_cerevisiae_R64-2-1.gff. Assuming the gff file is saved to the current working directory:
refSource <- "saccharomyces_cerevisiae_R64-2-1.gff"
Creating a new TSSr object using the constructer "new" function. In addition to the input data file "inputFiles" and file type "inputFilesType", BSgenome object name, and genome annotaion file "refSource", several other arguments need to be specified, including "sampleLabels", "sampleLabelsMerged", and "mergeIndex" and "organismName". In the example data, sample lables "SL01, SL02, SL03 and SL04" were used for "S01.sorted.bam, S02.sorted.bam, S03.sorted.bam, S04.sorted.bam", respectively (sampleLabels = c("SL01","SL02","SL03","SL04")). If a TSS table is used as input data, the sample lable of each sample is its column name in the TSS table. "sampleLabelsMerged" will be used if the user need to merge TSS signals from multiple samples (e.g., biological replicates) into a single dataset. In this case, SL01 and SL02 are two biological replicates obtained from S. cerevisiae cells grown in rich medium ("control"), while SL03 and SL04 were obtained by treating S. cerevisiae with α factor ("treat"). Thus, sampleLabelsMerged = c("control","treat") and mergeIndex = c(1,1,2,2), which means SL01 and SL02 will be merged as group 1 "control", and SL03 and SL04 will be merged as group 2 "treat". "organismName" is the species name of the samples, which will be used to annotate TSS clusters by "annotateCluster".
myTSSr <- new("TSSr", genomeName = "BSgenome.Scerevisiae.UCSC.sacCer3"
,inputFiles = inputFiles
,inputFilesType = inputFilesType
,sampleLabels = c("SL01","SL02","SL03","SL04")
,sampleLabelsMerged = c("control","treat")
,mergeIndex = c(1,1,2,2)
,refSource = refSource
,organismName = "Saccharomyces cerevisiae")
To display the available slots in the created TSSr object by typing the TSSr object name. Most slots in the newly created TSSr object are empty at this point since no analysis has been conducted.:
myTSSr
# An object of class "TSSr"
# Slot "genomeName":
# [1] "BSgenome.Scerevisiae.UCSC.sacCer3"
#
# Slot "inputFiles":
# [1] "S01.sorted.bam" "S02.sorted.bam" "S03.sorted.bam" "S04.sorted.bam"
#
# Slot "inputFilesType":
# [1] "bam"
#
# Slot "sampleLabels":
# [1] "SL01" "SL02" "SL03" "SL04"
#
# Slot "sampleLabelsMerged":
# [1] "control" "treat"
#
# Slot "librarySizes":
# numeric(0)
#
# Slot "TSSrawMatrix":
# data frame with 0 columns and 0 rows
#
# Slot "mergeIndex":
# [1] 1 1 2 2
#
# Slot "TSSprocessedMatrix":
# data frame with 0 columns and 0 rows
#
# Slot "tagClusters":
# list()
#
# Slot "consensusClusters":
# list()
#
# Slot "clusterShape":
# list()
#
# Slot "refSource":
# [1] "saccharomyces_cerevisiae_R64-2-1.gff"
#
# Slot "refTable":
# data frame with 0 columns and 0 rows
#
# Slot "organismName":
# [1] "Saccharomyces cerevisiae"
#
# Slot "assignedClusters":
# list()
#
# Slot "unassignedClusters":
# list()
#
# Slot "filteredClusters":
# list()
#
# Slot "DEtables":
# list()
#
# Slot "TAGtables":
# list()
#
# Slot "PromoterShift":
# list()
TSS calling from bam files or retrieving TSS data from TSS table using "getTSS" function. The "getTSS" function identifies genomic coordinates of all TSSs and calculates read counts supporting each TSS from bam file and return values to the slot of TSSrawMatrix. Before TSS calling, TSSr removes reads that are below certain sequencing quality and mapping quality. The default threshold for Phred quality score is 10, and mapping quality (MAPQ score) is 20. Users may change these arguments by setting different values for “sequencingQualityThreshold” and “mappingQualityThreshold” when running the “getTSS” function. If a mapped TSS sequencing read starts with a G that is a mismatch to the reference genome, the uncoded 5’ end G is likely the m7G cap, and thus the uncoded G will be removed from TSS calling by "getTSS". If a matched G at the 5’end of a tag is considered as an added cap, TSSr treats the 5’end of reads with matched G as genome-coded G, and the first G is not removed when calling TSS. This strategy is based on a stronge preference of PyPu dinucleotide at the [-1, +1] sites. This strategy also makes TSSr suitable for calling TSSs from 5’end sequencing reads that are not based on cap capture techniques.
Considering that soft-clipping during mapping of sequencing reads to reference genome could also exclude non-matching 5’end bases other than unmatched Gs, which introduces false positive TSSs. Therefore, we’d recommend not to use soft-clipping for sequencing data. In the case that soft-clipped bam files are used, the users should set "softclippingAllowed = TRUE" for getTSS function.
getTSS(myTSSr)
TSS calling from bam files or retrieving TSS data from TSS table
myTSSr@TSSrawMatrix
# chr pos strand SL01 SL02 SL03 SL04
# 1: chrI 1561 + 0 0 0 1
# 2: chrI 5759 + 1 0 0 0
# 3: chrI 5765 + 1 0 0 0
# 4: chrI 5773 + 1 0 0 0
# 5: chrI 5925 + 0 1 0 0
# ---
# 163199: chrII 810860 - 0 0 0 2
# 163200: chrII 810963 - 0 1 0 0
# 163201: chrII 811112 - 0 0 1 0
# 163202: chrII 811370 - 1 0 0 0
# 163203: chrII 812101 - 1 0 0 0
TSSrawMatrix contains genomic coordinates and read counts of each called TSS. TSSrawMatrix can be exported by "exportTSStable" function to a TSS table "ALL.samples.TSS.raw.txt" in the working directory. The TSS table can be used as input file for future downstram analyses using TSSr or other tools. It is advised to export unmerged TSSrawMatrix to an TSS table to avoid repeated TSS calling step.
exportTSStable(myTSSr, data = "raw", merged = "FALSE")
The "plotCorrelation" function is used to calculate the pairwise correlation coefficients and plot pairwise scatter plots of TSS counts. A subset of samples can also be specified to display the pairwise correlations. Three correlation methods are supported: “pearson”, “kendall”, or “spearman”. The default setting generates scatter plots for all samples (samples = "all"). For plotting a subset of samples, users may provide a list of sample labels for "sample", e.g., samples = c("SL01","SL02","SL03").
plotCorrelation(myTSSr, samples = "all")
plotTssPCA(myTSSr, TSS.threshold=10)
mergeSamples(myTSSr)
myTSSr@TSSprocessedMatrix
# chr pos strand control treat
# 1: chrI 1561 + 0 1
# 2: chrI 5759 + 1 0
# 3: chrI 5765 + 1 0
# 4: chrI 5773 + 1 0
# 5: chrI 5925 + 1 0
# ---
#163199: chrII 810860 - 0 2
#163200: chrII 810963 - 1 0
#163201: chrII 811112 - 0 1
#163202: chrII 811370 - 1 0
#163203: chrII 812101 - 1 0
To return library sizes (number of total read counts) of merged samples in TSSr object in the specified order (note: order is specified in mergeSample function):
myTSSr@librarySizes
# control treat
# 3221609 5131202
Library sizes are usually different among samples. To provide between-sample comparability, the raw read counts of each TSS need to be scaled as tags per million mapped reads (TPM) with normalizeTSS function. TSSr also provides two options to exclude TSSs with a low support: "poisson" and "tpm". If you intend to remove TSSs with low support using the “poisson” option, you should skip the "normalizeTSS" because the "poisson" method (method = "poisson") use the raw read count data to calculate the probability of observing k numbers of reads supporting each TSS based on the sequencing depth of the sample per the Poisson distribution. Only TSSs with a significantly larger number of supporting reads than expected (default threshold p < 0.01) are considered as qualified TSSs. Non-significant TSSs are thus filtered by TSSr. TSSr will normalize the raw read counts as TPM after filtering by the "poisson" method.
filterTSS(myTSSr, method = "poisson")
If you use the "normalizeTSS" function, you may use the “filterTSS” option to remove TSSs with low support from mapped reads based on TMP value (method = "TPM"), and any TSS that has a lower TPM value than user-defined threshold "tpmLow" will be removed (default TPM threshold = 0.1).
normalizeTSS(myTSSr)
filterTSS(myTSSr, method = "TPM")
myTSSr@TSSprocessedMatrix
# chr pos strand control treat
# 1: chrI 1561 + 0.000000 0.194886
# 2: chrI 5759 + 0.310404 0.000000
# 3: chrI 5765 + 0.310404 0.000000
# 4: chrI 5773 + 0.310404 0.000000
# 5: chrI 5925 + 0.310404 0.000000
# ---
# 163199: chrII 810860 - 0.000000 0.389772
# 163200: chrII 810963 - 0.310404 0.000000
# 163201: chrII 811112 - 0.000000 0.194886
# 163202: chrII 811370 - 0.310404 0.000000
# 163203: chrII 812101 - 0.310404 0.000000
Simialr to the raw read counts, the normalized and filtered (processed) TSS matrix can be exported to delimited text file with "exportTSStable" function or bedGraph/BigWig files with "exportTSStoBedgraph" fucntion. bedGraph/BigWig files can be visualized in the UCSC Genome Browser or Integrative Genomics Viewer (IGV) or Genome Browser at YeasTSS.org for selected yeast species.
exportTSStable(myTSSr, data = "processed")
exportTSStoBedgraph(myTSSr, data = "processed", format = "bedGraph")
exportTSStoBedgraph(myTSSr, data = "processed", format = "BigWig")
The “clusterTSS” function was designed to group neighboring TSSs into distinct TSS clusters (TCs), representing putative core promoters. It implements a TSS clustering algorithm based on peaking identification, namely “peakclu” (peak clustering) (Lu and Lin, 2021). Briefly, peakclu applies a sliding window approach (default window size = 100 bp with step size = 1) to scan TSS signals from the 5′ end of both strands of each chromosome. In each window, the TSS with the highest TPM value was identified as the peak. The surrounding TSSs are grouped with the peak into a TC. The clustering process of a TC terminates if a TSS is ≥ n bp (default n = 25) away from the nearest upstream TSS. In addition to setting a minimal allowed distance between peaks, TSSr offers another option to set maximal allowed extension distance between neighboring TSSs around peaks, which enables users to define the boundaries between neighboring core promoters. clusterTSS calculates inter-quantile width of a core promoter based on the cumulative distribution of TSS signals within the promoter. The positions of the 10th to 90th quantiles of TSS signals, which include at least 80% transcription initiation signals within a cluster, were defined as the 5’ and 3’ boundaries of the core promoter. The clustering step might be slow especially when the number of TSSs is in millions. Using multicores is highly recommended.
clusterTSS(myTSSr, method = "peakclu",peakDistance=100,extensionDistance=30
,localThreshold = 0.02, clusterThreshold = 1
,useMultiCore=FALSE, numCores=NULL)
myTSSr@tagClusters
# $control
# cluster chr start end strand dominant_tss tags tags.dominant_tss q_0.1 q_0.9 interquantile_width
# 1: 1 chrI 6530 6564 + 6548 9.312118 3.724847 6530 6562 33
# 2: 2 chrI 7166 7189 + 7169 1.241616 0.620808 7166 7189 24
# 3: 3 chrI 8084 8087 + 8084 1.241616 0.620808 8084 8087 4
# 4: 4 chrI 9325 9411 + 9327 4.966464 1.241616 9327 9391 65
# 5: 5 chrI 9442 9479 + 9442 3.414443 1.862423 9442 9468 27
# ---
# 3303: 3303 chrII 804855 804975 - 804895 260.739280 71.703301 804868 804899 32
# 3304: 3304 chrII 807129 807179 - 807165 16.761812 6.518482 807161 807168 8
# 3305: 3305 chrII 808679 808723 - 808723 1.862424 0.931212 808679 808723 45
# 3306: 3306 chrII 809353 809377 - 809377 1.862424 1.552020 809353 809377 25
# 3307: 3307 chrII 809707 809748 - 809707 4.035251 2.172827 809707 809724 18
# $treat
# cluster chr start end strand dominant_tss tags tags.dominant_tss q_0.1 q_0.9 interquantile_width
# 1: 1 chrI 6521 6564 + 6559 5.067037 1.753975 6530 6559 30
# 2: 2 chrI 7096 7118 + 7100 1.948860 0.584658 7096 7114 19
# 3: 3 chrI 8061 8087 + 8061 1.948860 0.584658 8061 8087 27
# 4: 4 chrI 9327 9476 + 9359 11.498278 1.753975 9327 9452 126
# 5: 5 chrI 11259 11343 + 11329 59.245376 21.827244 11288 11329 42
# ---
# 3244: 3244 chrII 804855 804952 - 804895 155.908888 54.568111 804872 804901 30
# 3245: 3245 chrII 807118 807179 - 807165 13.642026 5.456811 807153 807165 13
# 3246: 3246 chrII 808655 808691 - 808679 1.753974 0.779544 808655 808691 37
# 3247: 3247 chrII 809377 809389 - 809377 1.169316 0.779544 809377 809389 13
# 3248: 3248 chrII 809707 809710 - 809707 1.364203 0.974431 809707 809710 4
The results of TSS clustering can be exported to delimited text file with "exportClustersTable" function or bedGraph files with "exportClustersToBed" fucntion.
exportClustersTable(myTSSr, data = "tagClusters")
exportClustersToBed(myTSSr, data = "tagClusters")
TSSr infers a set of consensus core promoters using the “consensusCluster” function to assign the same ID for TCs belong to the same core promoter, which allows subsequent comparative studies across samples. TCs from different samples are considered to belong to the same consensus core promoter if the distance of their dominant TSSs is smaller than a user-defined distance (default = 50 bp). Similarly to clusterTSS function, consensusCluster function also returns genomic coordinates, sum of TSS tags, dominate TSS coordinate, a lower (q0.1) and an upper (q0.9) quantile coordinates, and interquantile widths for each consensus cluster in each sample.
consensusCluster(myTSSr, dis = 50, useMultiCore = FALSE, numCores = NULL)
Similarly, the detailed information of consensus clusters can be exported to delimited text files with "exportClustersTable" function or bedGraph files with "exportClustersToBed" fucntion.
exportClustersTable(myTSSr, data = "consensusClusters")
exportClustersToBed(myTSSr, data = "consensusClusters")
Core promoter shape reflects the distribution of TSS signals within a core promoter. TSSr provides three different options, inter-quantile width, promoter shape score (PSS) and shape index (SI), to quantify core promoter shape. Inter-quantile width refers to the distance between the locations of the 10th percentile to the 90th percentile TSS signals within a TSS cluster. Thus, it measures the width of a core promoter, but lacks the information of distribution patterns of TSS signals within a core promoter. Inter-quantile width could be significantly affected by different clustering methods. plotInterQuantile function plots interquantile width of each sample.
plotInterQuantile(myTSSr,samples = "all",tagsThreshold = 1)
Promoter shape score (PSS) integrates both inter quantile width and the observed probabilities of tags at every TSSs within a cluster (Lu and Lin 2019). PSS can be calculated using using shapeCluster function with method set as “PSS”. The smaller value represents the sharper core promoter. PSS is 0 representing singletons. SI is determined by the probabilities of tags at every TSSs within one cluster (Hoskins, Landolin et al. 2011). SI is also calculated using shapeCluster function with method set as “SI”. The greater value represents the sharper core promoter. The SI = 2 represents singletons.
shapeCluster(myTSSr,clusters = "consensusClusters", method = "PSS",useMultiCore= FALSE, numCores = NULL)
myTSSr@clusterShape
# $control
# cluster chr start end strand dominant_tss tags tags.dominant_tss q_0.1 q_0.9 interquantile_width shape.score
# 1: 1 chrI 6530 6564 + 6548 9.312118 3.724847 6530 6562 33 9.646916
# 2: 3 chrI 7166 7189 + 7169 1.241616 0.620808 7166 7189 24 6.877444
# 3: 4 chrI 8084 8087 + 8084 1.241616 0.620808 8084 8087 4 2.000000
# 4: 5 chrI 9325 9411 + 9327 4.966464 1.241616 9327 9391 65 17.767262
# 5: 6 chrI 9442 9479 + 9442 3.414443 1.862423 9442 9468 27 7.469695
# ---
# 3302: 4285 chrII 804855 804975 - 804895 260.739280 71.703301 804868 804899 32 14.764972
# 3303: 4286 chrII 807129 807179 - 807165 16.761812 6.518482 807161 807168 8 5.251317
# 3304: 4287 chrII 808679 808723 - 808723 1.862424 0.931212 808679 808723 45 9.844044
# 3305: 4288 chrII 809353 809377 - 809377 1.862424 1.552020 809353 809377 25 3.018611
# 3306: 4289 chrII 809707 809748 - 809707 4.035251 2.172827 809707 809724 18 7.425270
#
# $treat
# cluster chr start end strand dominant_tss tags tags.dominant_tss q_0.1 q_0.9 interquantile_width shape.score
# 1: 1 chrI 6521 6564 + 6559 5.067037 1.753975 6530 6559 30 13.123285
# 2: 2 chrI 7096 7118 + 7100 1.948860 0.584658 7096 7114 19 9.333375
# 3: 4 chrI 8061 8087 + 8061 1.948860 0.584658 8061 8087 27 9.012708
# 4: 5 chrI 9327 9476 + 9359 11.498278 1.753975 9327 9452 126 25.664108
# 5: 7 chrI 11259 11343 + 11329 59.245376 21.827244 11288 11329 42 15.020408
# ---
# 3240: 4285 chrII 804855 804952 - 804895 155.908888 54.568111 804872 804901 30 13.439327
# 3241: 4286 chrII 807118 807179 - 807165 13.642026 5.456811 807153 807165 13 7.308196
# 3242: 4287 chrII 808655 808691 - 808679 1.753974 0.779544 808655 808691 37 10.725295
# 3243: 4288 chrII 809377 809389 - 809377 1.169316 0.779544 809377 809389 13 3.398098
# 3244: 4289 chrII 809707 809710 - 809707 1.364203 0.974431 809707 809710 4 1.726241
The distributions of core promoter shape (PSS or SI) can be illustrated by a histograms with plotShape function. Please be noted that there is only one shapeCluster slot, so either PSS or SI can be saved in the shapeCluster slot depending on which method argument was used for shapeCluster.
plotShape(myTSSr)
The complete list of PSS can be exported to delimited text file with "exportShapeTable" function.
exportShapeTable(myTSSr)
Assigning TCs to downstream genes as their core promoters is required for annotation of the 5’ boundaries of genomic features. This process is also a prerequisite for further interrogations of regulated transcription initiation at the gene level. TSSr offers the “annotateCluster” function to assign TCs to their downstream genes. The assignment of a TC to a gene is based on the distance between the position of the dominant TSS of a TC and the annotated 5’ends of coding sequences (CDS) or transcripts. The default maximum distance between the dominant TSS and CDS is 1000 bp (“upstream = 1000”). If a TC overlaps with the CDS of an upstream gene, the dominant TSS of the TC must be within 500 bp to the 3’end of the overlapping CDS by default (“upstreamOverlap = 500”) to be eligible to assign to its downstream gene. If the 5’ends of annotated transcripts, instead of CDS, are used for TC assignment, users should set “annotationType” as “transcript,” and the default distance parameter is 500 bp. Because the genomes size and the number of introns vary substantially among organisms, it is necessary to apply customized criteria for TC assignment for different organisms. Users are advised to adjust the assignment criteria for core promoter assignment in TSSr. By default, only TCs with ≥ 0.02 TPM are used for the annotation process. To reduce transcriptional or technical noise of small clusters downstream a strong cluster, the filterCluster argument was set as "filterClusterThreshold = 0.02", indicating that any TC with TPM value < 0.02 TPM will be excluded from assigning to genes.
annotateCluster(myTSSr,clusters = "consensusClusters",filterCluster = TRUE,
filterClusterThreshold = 0.02, annotationType = "genes"
,upstream=1000, upstreamOverlap = 500, downstream = 0)
myTSSr@assignedClusters
# $control
# cluster chr start end strand dominant_tss tags tags.dominant_tss q_0.1 q_0.9 interquantile_width gene inCoding
# 1: 5 chrI 9325 9411 + 9327 4.966464 1.241616 9327 9391 65 YAL066W <NA>
# 2: 6 chrI 9442 9479 + 9442 3.414443 1.862423 9442 9468 27 YAL066W <NA>
# 3: 7 chrI 11253 11343 + 11329 40.662913 17.693022 11275 11329 55 YAL064W-B <NA>
# 4: 33 chrI 31026 31151 + 31108 48.112608 17.072215 31108 31145 38 YAL062W <NA>
# 5: 34 chrI 31187 31263 + 31242 320.026426 122.609541 31212 31242 31 YAL062W <NA>
# ---
# 856: 4279 chrII 801003 801075 - 801023 11.174543 4.656059 801018 801061 44 YBR296C-A <NA>
# 857: 4284 chrII 804563 804593 - 804592 1.552020 0.931212 804563 804593 31 YBR298C <NA>
# 858: 4285 chrII 804855 804975 - 804895 260.739280 71.703301 804868 804899 32 YBR298C <NA>
# 859: 4288 chrII 809353 809377 - 809377 1.862424 1.552020 809353 809377 25 YBR300C <NA>
# 860: 4289 chrII 809707 809748 - 809707 4.035251 2.172827 809707 809724 18 YBR300C <NA>
#
# $treat
# cluster chr start end strand dominant_tss tags tags.dominant_tss q_0.1 q_0.9 interquantile_width gene inCoding
# 1: 5 chrI 9327 9476 + 9359 11.498278 1.753975 9327 9452 126 YAL066W <NA>
# 2: 7 chrI 11259 11343 + 11329 59.245376 21.827244 11288 11329 42 YAL064W-B <NA>
# 3: 8 chrI 11907 11934 + 11919 2.338633 0.974431 11916 11922 7 YAL064W-B <NA>
# 4: 19 chrI 21237 21248 + 21246 1.948861 1.169317 21240 21246 7 YAL064W <NA>
# 5: 34 chrI 31087 31263 + 31242 117.711208 34.299955 31119 31242 124 YAL062W <NA>
# ---
# 810: 4271 chrII 799104 799125 - 799125 1.169316 0.584658 799104 799125 22 YBR296C <NA>
# 811: 4279 chrII 801017 801061 - 801023 6.821011 2.338633 801023 801058 36 YBR296C-A <NA>
# 812: 4285 chrII 804855 804952 - 804895 155.908888 54.568111 804872 804901 30 YBR298C <NA>
# 813: 4288 chrII 809377 809389 - 809377 1.169316 0.779544 809377 809389 13 YBR300C <NA>
# 814: 4289 chrII 809707 809710 - 809707 1.364203 0.974431 809707 809710 4 YBR300C <NA>
Instead of visualizing TSSs and core promoters in the UCSC Genome Browser or IGV, plotTSS function is able to generate publish ready figures when list of interested genes are provided and plotting region is specified.
The results of TC annotation can be exported to delimited text files with "exportClustersTable" function.
exportClustersTable(myTSSr, data = "assigned")
TSS data allow for the robust identification of enhancers by transcription of enhancer RNAs (eRNAs). Active enhancers produce bidirectional transcription of capped eRNAs, resulting in two diverging tag clusters by at most 400 bp. TSSr can identify this bidirectional cluster pairs and calculate a sample-set wide directionality score D for each locus (Andersson et al., 2014). D = (F-R)/(F+R), where F is the aggregated normalized tag counts in forward strandm and R is the aggregated normalized tag counts in reverse strand. Putative enhancers were then filtered with |D| < 0.8.
callEnhancer(myTSSr, flanking = 400)
The results of putative enhancers can be exported to delimited text files with "exportEnhancerTable" function.
exportEnhancerTable(myTSSr)
The number of tags at each TSS reflects the number of transcripts initiated at the TSS. Thus, TSS data can be used for expression profiling. With specified sample pairs for comparison, deGene function counts raw tags of each consensus clusters and utilizes the DESeq2 package (Love, Huber et al. 2014) for differential expression analysis.
deGene(myTSSr,comparePairs=list(c("control","treat")), pval = 0.01,useMultiCore=FALSE, numCores=NULL)
Differential expression analysis results can be visualized by plotDE function which generates a volcano plots. Names of genes differential expressed between the compared pairs are displayed on the dots when the withGeneName argument is set as TRUE.
plotDE(myTSSr, withGeneName = "TRUE",xlim=c(-2.5, 2.5),ylim=c(0,10))
Differential expression analysis results can also be exported to a text file with exportDETable function.
exportDETable(myTSSr, data = "sig")
One gene might have multiple core promoters which can be used differently in different samples. TSSr implements degree of shift (Ds) algorithm (Lu and Lin 2019) to quantify the degree of promoter shift across different samples.
shiftPromoter(myTSSr,comparePairs=list(c("control","treat")), pval = 0.01)
myTSSr@PromoterShift
# $control_VS_treat
# gene Ds pval padj
# 1: YBL017C 16.5567506300711 0 0.000000e+00
# 2: YBR006W -26.3346456736974 0 0.000000e+00
# 3: YBR023C 18.2077046044299 0 0.000000e+00
# 4: YBR039W -27.6613679948891 0 0.000000e+00
# 5: YBR121C -32.7177356221814 0 0.000000e+00
# ---
# 70: YBR158W 7.23779193641262 0.00110612692207393 4.424508e-03
# 71: YBR140C -6.96937554342683 0.00119456280847422 4.710952e-03
# 72: YBR157C 0.867544334037256 0.00122552060110212 4.765913e-03
# 73: YBR143C -1.42890967687923 0.00163014425826738 6.252608e-03
# 74: YBL060W -1.471483572077 0.00197980632089202 7.491159e-03
# 75: YBR053C 2.26685314089749 0.00259863406742416 9.701567e-03
# gene Ds pval padj
Below is an example of core promoter shift in genes YBR168W and YBL067C. The two major core promoters are differently used in control and treat samples.
plotTSS(myTSSr,samples=c("control","treat"),tssData = "processed"
,clusters = "assigned"
,clusterThreshold = 0.02
,genelist=c("YBR168W","YBL067C")
,up.dis =500,down.dis = 100,yFixed=TRUE)
Results of core promoter shift analysis can also be exported to a text file with exportShiftTable function.
exportShiftTable(myTSSr)
Zhenguo Lin, PhD
Department of Biology, Saint Louis University, USA.
Email: zhenguo.lin@slu.edu
Andersson, R., Gebhard, C., Miguel-Escalada, I., Hoof, I., Bornholdt, J., Boyd, M., Chen, Y., Zhao, X., Schmidl, C., Suzuki, T., Ntini, E., Arner, E., Valen, E., Li, K., Schwarzfischer, L., Glatz, D., Raithel, J., Lilje, B., Rapin, N., Bagger, F. O., … Sandelin, A. (2014). "An atlas of active enhancers across human cell types and tissues". Nature 507(7493), 455–461.
Arribere, J. A. and W. V. Gilbert (2013). "Roles for transcript leaders in translation and mRNA decay revealed by transcript leader sequencing." Genome Res 23(6): 977-987.
Cumbie, J. S., M. G. Ivanchenko and M. Megraw (2015). "NanoCAGE-XL and CapFilter: an approach to genome wide identification of high confidence transcription start sites." BMC Genomics 16: 597.
Cvetesic, N., H. G. Leitch, M. Borkowska, F. Muller, P. Carninci, P. Hajkova and B. Lenhard (2018). "SLIC-CAGE: high-resolution transcription start site mapping using nanogram-levels of total RNA." Genome Res 28(12): 1943-1956.
Kruesi, W. S., L. J. Core, C. T. Waters, J. T. Lis and B. J. Meyer (2013). "Condensin controls recruitment of RNA polymerase II to achieve nematode X-chromosome dosage compensation." Elife 2: e00808.
Mahat, D. B., H. Kwak, G. T. Booth, I. H. Jonkers, C. G. Danko, R. K. Patel, C. T. Waters, K. Munson, L. J. Core and J. T. Lis (2016). "Base-pair-resolution genome-wide mapping of active RNA polymerases using precision nuclear run-on (PRO-seq)." Nat Protoc 11(8): 1455-1476.
Mahat, D. B., H. Kwak, G. T. Booth, I. H. Jonkers, C. G. Danko, R. K. Patel, C. T. Waters, K. Munson, L. J. Core and J. T. Lis (2016). "Base-pair-resolution genome-wide mapping of active RNA polymerases using precision nuclear run-on (PRO-seq)." Nat Protoc 11(8): 1455-1476.
Malabat, C., F. Feuerbach, L. Ma, C. Saveanu and A. Jacquier (2015). "Quality control of transcription start site selection by nonsense-mediated-mRNA decay." Elife 4.
Murata, M., H. Nishiyori-Sueki, M. Kojima-Ishiyama, P. Carninci, Y. Hayashizaki and M. Itoh (2014). "Detecting expressed genes using CAGE." Methods Mol Biol 1164: 67-85.
Pelechano, V., W. Wei and L. M. Steinmetz (2013). "Extensive transcriptional heterogeneity revealed by isoform profiling." Nature 497(7447): 127-131.
Takahashi, H., T. Lassmann, M. Murata and P. Carninci (2012). "5' end-centered expression profiling using cap-analysis gene expression and next-generation sequencing." Nat Protoc 7(3): 542-561.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.