In aertslab/cisTopic: cisTopic: Probabilistic modelling of cis-regulatory topics from single cell epigenomics data

Vignette built on r format(Sys.time(), "%b %d, %Y") with cisTopic version r packageVersion("cisTopic").

Installation

R < 3.5

If your R version is below 3.5, you will need to install manually the following packages:

devtools::install_github("aertslab/AUCell")
devtools::install_github("aertslab/RcisTarget")

source("https://bioconductor.org/biocLite.R")
biocLite('GenomicRanges')

cisTopic package

For installing cisTopic run:

devtools::install_github("aertslab/cisTopic")

Vignette packages

In this vignette, you will require additional packages:

source("https://bioconductor.org/biocLite.R")
biocLite(c('Rsubread', 'umap', 'Rtsne', 'ComplexHeatmap', 'fastcluster', 'data.table', 'rGREAT', 'ChIPseeker', 'TxDb.Hsapiens.UCSC.hg19.knownGene', 'org.Hs.eg.db', 'densityClust'))

What is cisTopic?

cisTopic is an R/Bioconductor package for the simulataneous identification of cis-regulatory topics and cell states from single cell epigenomics data. cisTopic relies on an algorithm called Latent Dirichlet Allocation (LDA), a robust Bayesian method used in text mining to group documents addressing similar topics and related words into topics. Interestingly, this model has a series of assumptions that are fulfilled in single-cell epigenomics data, such as non-ordered features ('bag of words') and the allowance of overlapping topics (i.e. a regulatory region can be co-accessible with different other regions depending on the context, namely, the cell type or state).

cisTopic now uses the WarpLDA implementation for topic modelling (Chen et al, 2016), where each region in each cell is assigned to a topic based on (1) to which topic the region is assigned in other cells and (2) to which topics the regions are assigned in that cell. In comparison to Collapsed Gibbs Smpling, both counts are fixed after all tokens are sampled, allowing to randomly access regions in cells for assignment; while with a CGS we need to update the counts after each assignment before making the next assignment. After a number of iterations through the data set, these assignments are used to estimate the probability of a region belonging to a cis-regulatory topic (region-topic distribution) and the contributions of a topic within each cell (topic-cell distribution). These distributions can in turn be used to cluster cells and identify cell types, and to analyse the regulatory sequences in each topic.

cisTopic consists of 4 main steps: (1) generation of a binary accessibility matrix as input for LDA; (2) LDA and model selection; (3) cell state identification using the topic-cell distributions from LDA and (4) exploration of the region-topic distributions.

If you do not want to run some of the steps in the tutorial, you can load the precomputed cisTopic object:

cisTopicObject <- readRDS('WarpLDA_cisTopicObject_pbmc.Rds')

Running cisTopic

Input data

Some steps in this tutorial might take a few minutes to run, as reference we mention the running time for this dataset and settings in our system. Your actual running time will depend on your computer and dataset. In this tutorial, we will run cisTopic on 5,335 Peripheral blood mononuclear cells (PBMCs) from a healthy donor, with 97k potential regulatory regions.

First, load cisTopic:

suppressWarnings(library(cisTopic))

The cisTopic object can be initialized for 10X data can either from (a) the count matrix produced by CellRanger ATAC; or (b) CellRanger ATAC fragments file and a bed file with the candidate regulatory regions. If the user prefers to define peaks in a different way from CellRanger ATAC, we recommend to use option b.

The cisTopic object contains all the information and outputs from the analysis. For more information, run:

?`cisTopic-class`

For initializing the cisTopic object:

a. Starting from the CellRanger ATAC count matrix [Reference running time (5k cells, 97k regions): 1,5 min]

pathTo10X <- '/10x_example/'
data_folder <- paste0(pathTo10X, 'filtered_peak_bc_matrix')
metrics <- paste0(pathTo10X, 'atac_v1_pbmc_5k_singlecell.csv')
cisTopicObject <- createcisTopicObjectFrom10Xmatrix(data_folder, metrics,  project.name='5kPBMCs')

b. Starting from the CellRanger ATAC fragments file and a bed file with potential rgeulatory regions. For more information regards the fragments file, please visit: https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/output/fragments. In this case, we will use as regulatory regions the ones provided by CellRanger ATAC and the function will count which fragments overlap with each region in each cell [Reference running time (5k cells, 97k regions): 20 min]

pathTo10X <- '10x_example/'
fragments <- paste0(pathTo10X, 'atac_v1_pbmc_5k_fragments.tsv.gz')
regions <-  paste0(pathTo10X, 'atac_v1_pbmc_5k_peaks.bed')
metrics <- paste0(pathTo10X, 'atac_v1_pbmc_5k_singlecell.csv')
cisTopicObject <- createcisTopicObjectFrom10X(fragments, regions, metrics, project.name='5kPBMCs')

By default, the slots @cell.data and @region.data will be initialized with the number of reads and accessible regions (depending on the selected threshold) per cell and region, respectively, and the metrics from CellRanger ATAC. Extra metadata can be added using the functions addCellMetadata (e.g. phenotype information) and addRegionMetadata. For example, we can add the graph-based clusters from CellRanger ATAC for comparison.

pathTo10X <- '/10x_example/'
pathTographBasedClusters_CRA <- paste0(pathTo10X, 'analysis/clustering/graphclust/clusters.csv') 
graphBasedClusters_CRA <- read.table(pathTographBasedClusters_CRA, sep=',', header=TRUE, row.names = 1)
colnames(graphBasedClusters_CRA) <- 'graphBasedClusters_CRA'
graphBasedClusters_CRA[,1] <- as.factor(graphBasedClusters_CRA[,1])
cisTopicObject <- addCellMetadata(cisTopicObject, graphBasedClusters_CRA)

Building the models

The next step in the cisTopic workflow is to use Latent Dirichlet Allocation (LDA) for the modelling of cis-regulatory topics. LDA allows to derive, from the original high-dimensional and sparse data, (1) the probability distributions over the topics for each cell in the data set and (2) the probability distributions over the regions for each topic (Blei et al., 2003). These distributions indicate, respectively, how important a regulatory topic is for a cell, and how important regions are for the regulatory topic. Here, we use WarpLDA (Chen et al, 2016), in which we assign regions to a certain topic by randomly sampling from a distribution where the probability of a region being assigned to a topic is proportional to the contributions of that region to the topic and the contributions of that topic in a cell, with delayed count update.

To do this, runWarpLDAModels() builds several models (e.g. with diferent numbers of topics) using Latent Dirichlet Allocation (LDA) on the binary accessibility matrix (automatically stored in the initialized cisTopicObject). We can then select the best model using selectModel().

The main parameters for running the models (runWarpLDAModels) are:

Number of topics (topic): The number of topics are usually slightly bigger than the potential cell states in the data set. In the case of single cell epigenomics data the number of topics is low compared to other implementations (e.g. text classification). The running time will be affected by the number of topics.
The Dirichlet hyperparameters alpha (topic proportions) and beta (topic multinomials): Alpha affects to the topic-cell contributions; a low alpha forces to pick for each cell a few topics with significant contribution, while a high alpha allows cells to have similar, smooth topic proportions. Beta affects to the region combinations; the lower the beta, the fewer regions a topic will have; the higher the beta, the less distinct these topics will be (i.e. there will be more overlap between the topics). By default, we select alpha as 50/number of topics and beta as 0.1 (as Griffiths & Steyvers, 2004).

In this tutorial, we will test models with 2, 5, 10 to 25, 30, 35, 40 topics [Reference running time for the example data set (5k cells; ~97k regions): 40 minutes].

cisTopicObject <- runWarpLDAModels(cisTopicObject, topic=c(2, 5, 10:25, 30, 35, 40), seed=987, nCores=20, iterations = 500, addModels=FALSE)

NOTE: For large data sets, we recommend to use the tmp parameter, that allows to save models in the specified folder as finished. If using these models for filling a cisTopicObject later on, you will need to fill the parameters in cisTopicObject@calc.params[['runWarpLDAModels']] or cisTopicObject@calc.params[['runCGSModels']] manually:

# For WarpLDA
cisTopicObject@calc.params[['runWarpLDAModels']]$seed <- seed
cisTopicObject@calc.params[['runWarpLDAModels']]$iterations <- iterations
cisTopicObject@calc.params[['runWarpLDAModels']]$alpha <- alpha
cisTopicObject@calc.params[['runWarpLDAModels']]$alphaByTopic <- alphaByTopic
cisTopicObject@calc.params[['runWarpLDAModels']]$beta <- beta

# For CGS
cisTopicObject@calc.params[['runCGSModels']]$seed <- seed
cisTopicObject@calc.params[['runCGSModels']]$burnin <- burnin
cisTopicObject@calc.params[['runCGSModels']]$iterations <- iterations
cisTopicObject@calc.params[['runCGSModels']]$alpha <- alpha
cisTopicObject@calc.params[['runCGSModels']]$alphaByTopic <- alphaByTopic
cisTopicObject@calc.params[['runCGSModels']]$beta <- beta

Selection of the best model

In this version of cisTopic, we incorporate three different approaches to select the best topic:

The log likelihood can be used to estimate the plausibility of a model parameter value, given the observed data (i.e. the highest the likelihood, the better the model). This is not recommended when running WarpLDA, as the curves stabilize rather than reaching a maximum.
The second derivative in each point of the likelihood curve, as it measures the changed in the curvature from point to point (i.e. the highest the second derivative means that the next model is not improving much more the log-likelihood) . This is not recommended when running CGS, as the curves tend to be noisier or reach a maximum. This is the default in this version of cisTopic.
The perplexity of each model (only for WarpLDA models) measures how well the model predicts the sample. The lower the perplexity is, the better the model.

Alternatively, you can also manually select the number of topics with the parameter select.

par(mfrow=c(3,3))
cisTopicObject <- selectModel(cisTopicObject, type='maximum')
cisTopicObject <- selectModel(cisTopicObject, type='perplexity')
cisTopicObject <- selectModel(cisTopicObject, type='derivative')

In this case we select the model based on the second derivative, with 25 topics.

Interpreting the models

A. Identification of cell states using the cell-cisTopic distributions

LDA returns two distributions that represent (1) the topic contributions per cell and (2) the region contribution to a topic. We can interpret these values as a dimensinality reduction method, after which the data is re-represented as a matrix with cells as columns, topics as rows and contributions as values. The recorded topic assignments to the cells (not normalised) are stored in cisTopicObject@selected.model$document_expects (see lda package).

Different methods can be used for clustering and/or visualization. cisTopic includes wrapper functions to easily run Umap, tSNE, diffussion maps and PCA (the results are saved in the slot @dr$cell):

cisTopicObject <- runtSNE(cisTopicObject, target='cell', seed=123, pca=FALSE, method='Probability')
cisTopicObject <- runUmap(cisTopicObject, target='cell', seed=123, method='Probability')

Additionally, we can use the cell-topic matrix to define clusters. The user can select the preferred methodology to do this. In this case, we will apply a peak density algorithm on the tsne dimensionality reduction projections, as in Cusanovich et al. (2018).

If you want to retrieve the normalised assignments for other analyses, you can use the function modelMatSelection:

cellassign <- modelMatSelection(cisTopicObject, 'cell', 'Probability')

set.seed(123)
library(Rtsne)
DR <- Rtsne(t(cellassign), pca=F)
DRdist <- dist(DR$Y)
library(densityClust)
dclust <- densityClust(DRdist,gaussian=T)
dclust <- findClusters(dclust, rho = 50, delta = 2.5)

# Check thresholds
options(repr.plot.width=6, repr.plot.height=6)
plot(dclust$rho,dclust$delta,pch=20,cex=0.6,xlab='rho', ylab='delta')
points(dclust$rho[dclust$peaks],dclust$delta[dclust$peaks],col="red",pch=20,cex=0.8)
text(dclust$rho[dclust$peaks]-2,dclust$delta[dclust$peaks]+1.5,labels=dclust$clusters[dclust$peaks])
abline(v=50)
abline(h=2.5)

# Add cluster information
densityClust <- dclust$clusters
densityClust <- as.data.frame(densityClust)
rownames(densityClust) <- cisTopicObject@cell.names
colnames(densityClust) <- 'densityClust'
densityClust[,1] <- as.factor(densityClust[,1])
cisTopicObject <- addCellMetadata(cisTopicObject, densityClust)

Once calculations are done, cisTopic offers a unified visualization function (plotFeatures), which allows to visualize tSNE, Umap, diffussion maps, principal components and biplots (in 2/3D), colored by metadata and/or topic enrichment.

par(mfrow=c(1,2))
plotFeatures(cisTopicObject, method='Umap', target='cell', topic_contr=NULL, colorBy=c('nCounts', 'nAcc','densityClust', 'graphBasedClusters_CRA'), cex.legend = 0.8, factor.max=.75, dim=2, legend=TRUE, col.low='darkgreen', col.mid='yellow', col.high='brown1', intervals=10)

We can also generate a heatmap based on the cell-cisTopic distributions.

cellTopicHeatmap(cisTopicObject, method='Z-score', colorBy=c('densityClust'), col.low = "dodgerblue", col.mid = "floralwhite", col.high = "brown1")

To color the Umap by topic score:

par(mfrow=c(2,5))
plotFeatures(cisTopicObject, method='Umap', target='cell', topic_contr='Probability', colorBy=NULL, cex.legend = 0.8, factor.max=.75, dim=2, legend=TRUE)

Enrichment of epigenomic signatures in the cells

By multiplying the cell and topic assignments, the likelihood of each region in each cell (i.e. predictive distribution). This matrix is stored in object@predictive.distribution. These distributions can be used to estimate drop-outs and build cell-specific region rankings that can be used with AUCell for estimating the enrichment of epigenomic signatures within the cells.

pred.matrix <- predictiveDistribution(cisTopicObject)

For example, we can evaluate which cells are more enriched for certain ChIP-seq signatures. First, epigenomic regions are intersected and mapped to regions in the dataset (by default, with at least 40% overlap). To test the enrichment of these signatures in each cell, we use a GSEA-like recovery curve ranking-based approach. In each cell, regions are ranked based on their probability (x-axis), and when a region is present in the signature we increase one step in the y-axis. The Area Under the Curve (AUC) is used to evaluate the importance of that signature within that cell. The corresponding overlapping sets (which are stored in object@signatures) are used as input, together with the cell-specific region rankings, for the function signatureCellEnrichment. AUC values for each specific signature are stored in object@cell.data. In this case, we can use bulk signatures from the hematopoietic system from Corces et al. (2016).

# Obtain signatures
path_to_signatures <- paste0(pathTo10X, 'Bulk_peaks/')
Bulk_ATAC_signatures <- paste(path_to_signatures, list.files(path_to_signatures), sep='')
labels  <- gsub('._peaks.narrowPeak', '', list.files(path_to_signatures))
cisTopicObject <- getSignaturesRegions(cisTopicObject, Bulk_ATAC_signatures, labels=labels, minOverlap = 0.4)

# To only keep unique peaks per signature
cisTopicObject@signatures <- llply(1:length(cisTopicObject@signatures), function (i) cisTopicObject@signatures[[i]][-which(cisTopicObject@signatures[[i]] %in% unlist(as.vector(cisTopicObject@signatures[-i])))]) 
names(cisTopicObject@signatures) <- labels

# Compute cell rankings
library(AUCell)
aucellRankings <- AUCell_buildRankings(pred.matrix, plot=FALSE, verbose=FALSE)

# Check signature enrichment in cells (Reference time: 1 min)
cisTopicObject <- signatureCellEnrichment(cisTopicObject, aucellRankings, selected.signatures='all', aucMaxRank = 0.3*nrow(aucellRankings), plot=FALSE)

# Plot
par(mfrow=c(2,2))
plotFeatures(cisTopicObject, method='Umap', target='cell', topic_contr=NULL, colorBy=c('CD4Tcell', 'Mono', 'Bcell', 'NKcell'), cex.legend = 0.4, factor.max=.75, dim=2, legend=TRUE, intervals=10)

NOTE: The predictive distributions and the AUCell rankings are not stored in the cisTopic object as they have a big size.

B. Analysis of the regulatory topics

Defining topics

To analyze the regions included in the cisTopics, the first step is always to derive a score that evaluates how likely is for a region to belong to a topic. getRegionsScores() calculates these scores based on the proportion of region specific assignments to a topic. These scores can be rescaled into the range [0,1], which will be useful for the binarization step (as it will force data to follow a gamma distribution shape). This information is stored in the region.data slot.

cisTopicObject <- getRegionsScores(cisTopicObject, method='NormTop', scale=TRUE)

BigWig files for observing the scored regions in the genome can be generated. Note that information on the length of the chromosomes has to be provided. These files can be uploaded in IGV or UCSC for visualisation. This information can be easily found in the TxDb objects of the corresponding genomes, for example.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

getBigwigFiles(cisTopicObject, path='output/cisTopics_asBW', seqlengths=seqlengths(txdb))

However, many tools are limited to work with sets of regions rather than rankings of regions. Keywords, or the most contributing regions in a topic, can be used as a representative set of regions of the topic. binarizecisTopics() allows to select the top regions based on two methods:

a) method = "Predefined": to select a predefined number of regions (determined by the cutoffs argument)

b) method = "GammaFit" (default): to automatically select a threshold based on a fit of the scores to a gamma distribution. This is recommended when using method="NormTop" and scale=TRUE in getRegionScores(). Note that the probability threshold must be provided by the user and it must be taken after the density (based on the fitted gamma distribution) is stabilised (i.e. in the tail of the distribution).

par(mfrow=c(2,5))
cisTopicObject <- binarizecisTopics(cisTopicObject, thrP=0.975, plot=TRUE)

The regions sets selected and distributions for each cisTopic can then be analized in different ways (examples in next sections). They can also be exported to bed files to analyze with external tools:

getBedFiles(cisTopicObject, path='output/cisTopics_asBed')

Topic visualization

Based on the topic scores for each region, different methods can be used for clustering and/or visualization (as shown for cells). cisTopic includes wrapper functions to easily run Umap, tSNE, diffussion maps and PCA (the results are saved in the slot @dr$region). In the case of regions, only high confidence regions (i.e. that pass the binarization threshold at least in 1 topic) are used:

cisTopicObject <- runtSNE(cisTopicObject, target='region', perplexity=200, check_duplicates=FALSE)

The function plotFeatures can also be used to visualize region-based tSNEs, diffussion maps, principal components and biplots (in 2/3D), colored by metadata and/or topic enrichment.

plotFeatures(cisTopicObject, method='tSNE', target='region', topic_contr=NULL, colorBy=c('nCells'), cex.legend = 0.8, factor.max=.75, dim=2, legend=TRUE, col.low='darkgreen', col.mid='yellow', col.high='brown1', intervals=10)

par(mfrow=c(2,5))
plotFeatures(cisTopicObject, method='tSNE', target='region', topic_contr='Z-score', colorBy=NULL, cex.legend = 0.8, factor.max=.75, dim=2, legend=TRUE, col.low='darkgreen', col.mid='yellow', col.high='brown1')

Enrichment of epigenomic signatures

Another way of exploring the topics is to check their overlap (i.e. the regions included in the topics) with predefined epigenomic signatures/datasets. For example, regions from ChIP-seq can point towards enrichment of binding sites of a given TF, regions from FAIRE- or ATAC-seq to regions generally open in a given cell type or tissue, etc.

First, epigenomic regions are intersected and mapped to regions in the dataset (by default, with at least 40% overlap). To test the enrichment of these signatures in each cell, we use a GSEA-like recovery curve ranking-based approach. In each topic, regions are ranked based on their topic probability (x-axis), and when a region is present in the signature we increase one step in the y-axis. The Area Under the Curve (AUC) is used to evaluate the importance of that signature within that topic. Signatures are also saved in object@region.data as logical columns (TRUE if the region is in the signature, otherwise FALSE). A heatmap showing the enrichment scores for each topic and signature can be obtained with signaturesHeatmap.

# Obtain signatures (if it has not been run before)
path_to_signatures <- paste0(pathTo10X, 'Bulk_peaks/')
Bulk_ATAC_signatures <- paste(path_to_signatures, list.files(path_to_signatures), sep='')
labels  <- gsub('._peaks.narrowPeak', '', list.files(path_to_signatures))
cisTopicObject <- getSignaturesRegions(cisTopicObject, Bulk_ATAC_signatures, labels=labels, minOverlap = 0.4)

# To only keep unique peaks per signature
cisTopicObject@signatures <- llply(1:length(cisTopicObject@signatures), function (i) cisTopicObject@signatures[[i]][-which(cisTopicObject@signatures[[i]] %in% unlist(as.vector(cisTopicObject@signatures[-i])))]) 
names(cisTopicObject@signatures) <- labels

We can visualize how these regions are enriched within each topic with signaturesHeatmap. With this function, we obtain a heatmap showing the row normalised AUC scores.

signaturesHeatmap(cisTopicObject)

Annotation to genes and GO terms

Another way of gaining insight on the topics is to link the regions to genes, and to determine GO terms (or pathways or any other type of gene-set) that are enriched within them. cisTopic provides the function annotateRegions() to annotate regions to GO terms using the "TxDb" Bioconductor packages (replace 'TxDb.Hsapiens.UCSC.hg19.knownGene' by the appropiate organism package), and annotation databases ("OrgDb" packages).

library(org.Hs.eg.db)
cisTopicObject <- annotateRegions(cisTopicObject, txdb=TxDb.Hsapiens.UCSC.hg19.knownGene, annoDb='org.Hs.eg.db')

As we saw before, we can use the region type annotations as region sets/signatures to check whether a topic is more enriched in a certain type of region. We see that regions that belong to topics that are not enriched for any of the cell type specific signatures are promoter topics. Additionally, the set of regions that are accessible in a higher number of cells are also promoters.

par(mfrow=c(1,1))
signaturesHeatmap(cisTopicObject, selected.signatures = 'annotation')
plotFeatures(cisTopicObject, method='tSNE', target='region', topic_contr=NULL, colorBy=c('annotation'), cex.legend = 0.8, factor.max=.75, dim=2, legend=TRUE, intervals=20)

For identifying enriched GO terms per topic, cisTopic provides a wrapper over rGREAT (Gu Z, 2017) [Reference running time: 20 minutes]. The binarised topics (i.e. sets of top regions per topic) are used in this step. Results are stored in object@binarized.rGREAT.

date()
cisTopicObject <- GREAT(cisTopicObject, genome='hg19', fold_enrichment=2, geneHits=1, sign=0.05, request_interval=10)
date()

We can visualize the enrichment results:

ontologyDotPlot(cisTopicObject, top=5, topics=c(6,10), var.y='name', order.by='Binom_Adjp_BH')

(Transcription factor) motif enrichment

It is also possible to identify enriched motifs within the topics and form cistromes (i.e. sets of sequences enriched for a given motif). To do this, we use RcisTarget (Aibar et al., 2017). The current version provides databases for human (hg19). You can find the region-based database at: https://resources.aertslab.org/cistarget/

For this analysis, we first need to convert the cisTopic regions to the regions in the databases ("ctx regions"). We can do this in two ways:

a) Binarised, converting the binarised topic to a set of equivalent ctx regions (a region can map to more than one ctx region, and all regions which overlap more than the given threshold are taken).

cisTopicObject <- binarizedcisTopicsToCtx(cisTopicObject, genome='hg19')

b) Based on the maximum overlap. This is useful if we need to use the scores (a region is mapped to its most overlapping ctx region). This information is stored in object@region.data.

cisTopicObject <- scoredRegionsToCtx(cisTopicObject, genome='hg19')

We are now ready to run RcisTarget in each topic using the wrapper function topicsRcisTarget(). This function uses the binarised topic regions converted to ctx regions.

pathToFeather <- "hg19-regions-9species.all_regions.mc8nr.feather"
cisTopicObject <- topicsRcisTarget(cisTopicObject, genome='hg19', pathToFeather, reduced_database=FALSE, nesThreshold=3, rocthr=0.005, maxRank=20000, nCores=5)

Once RcisTarget is run, interactive motif enrichment tables can be explored (e.g. per topic):

Topic6_motif_enr <- cisTopicObject@binarized.RcisTarget[[6]]
DT::datatable(Topic6_motif_enr[,-c("enrichedRegions", "TF_lowConf"), with=FALSE], escape = FALSE, filter="top", options=list(pageLength=5))

Topic10_motif_enr <- cisTopicObject@binarized.RcisTarget[[10]]
DT::datatable(Topic10_motif_enr[,-c("enrichedRegions", "TF_lowConf"), with=FALSE], escape = FALSE, filter="top", options=list(pageLength=5))

We find SPI1 and CEBPB as master regulators, for example, of the B cell and the monocyte topic, respectively.

Formation of cistromes

RcisTarget results can be used to form cistromes. We define a cistrome as a set of sequences enriched for motifs linked to a certain transcription factor. In the case of cisTopic, we build topic-specific cistromes. cisTopic produces 3 different types of cistromes: ctx-regions based, original-regions based and gene based (based on region annotation). The annotation parameter decides whether only motifs linked with high confidence should be used or also motifs indirectly annotated should be considered (i.e. in this case, the *_extended cistromes will contain both annotations).

cisTopicObject <- getCistromes(cisTopicObject, annotation = 'Both', nCores=5)

Cistromes are useful to compare regions linked to a TF which have different spatio-temporal patterns, which may be caused i.e. by the presence of co-factors or different concentrations of the TF. Importantly, we can also estimate and visualize the enrichment of these topic specific cistromes in the cells (as shown above). For example, below we show the different enrichment pattern of cell type-specific SPI1 regions.

# Compute AUC rankings based on the predictive distribution
pred.matrix <- predictiveDistribution(cisTopicObject)

library(AUCell)
aucellRankings <- AUCell_buildRankings(pred.matrix, plot=FALSE, verbose=FALSE)

cisTopicObject <- getCistromeEnrichment(cisTopicObject, topic=6, TFname='SPI1', aucellRankings = aucellRankings, aucMaxRank = 0.05*nrow(aucellRankings), plot=FALSE)
cisTopicObject <- getCistromeEnrichment(cisTopicObject, topic=10, TFname='SPI1', aucellRankings = aucellRankings, aucMaxRank = 0.05*nrow(aucellRankings), plot=FALSE)

par(mfrow=c(1,2))
plotFeatures(cisTopicObject, method='Umap', target='cell', topic_contr=NULL, colorBy=c('Topic6_SPI1 (751p)','Topic10_SPI1 (587p)'), cex.legend = 0.8, factor.max=.75, dim=2, legend=TRUE, intervals=10)

Differential motif enrichment can be performed using RSAT or Homer (Medina-Rivera et al, 2015; Heinz et al, 2017); and shape features can be modelled per sequence using GBshape bigwig files (Chiu et al., 2015). These features can be used as input to Machine Learning methods (i.e. Random Forest) to determine their relevance in generating the different patterns.

Gene accessibility scores

We can also evaluate the overall accessibility around certain markers using the predictive distribution. In this case, we will sum the probability of each region linked to each marker gene (based on ChIPseeker annotations in this tutorial).

region2gene <- cisTopicObject@region.data[,'SYMBOL', drop=FALSE]
region2gene <- split(region2gene, region2gene[,'SYMBOL']) 
region2gene <- lapply(region2gene, rownames) 

# From pretrained Garnett's model markers on pBMC dataset from 10X (Pliner et al, 2019)
selectedGenes <- c('CD34', 'THY1', 'ENG', 'KIT', 'PROM1', #CD34+
                   'NCAM1', 'FCGR3A', #NKcells
                   'CD14', 'FCGR1A', 'CD68', 'S100A12', #Monocytes
                   'CD19', 'MS4A1', 'CD79A', #Bcells
                   'CD3D', 'CD3E', 'CD3G', #Tcells
                   'CD4', 'FOXP3', 'IL2RA', 'IL7R', #CD4 Tcell
                   'CD8A', 'CD8B', #CD8 Tcell
                   'IL3RA', 'CD1C', 'BATF3', 'THBD', 'CD209' #Dendritic cells
                   )
region2gene_subset <- region2gene[which(names(region2gene) %in% selectedGenes)]
predMatSumByGene <- sapply(region2gene_subset, function(x) apply(pred.matrix[x,, drop=F], 2, sum))
rownames(predMatSumByGene) <- cisTopicObject@cell.names
# Add to cell data
cisTopicObject <- addCellMetadata(cisTopicObject, predMatSumByGene)

par(mfrow=c(1,2))
plotFeatures(cisTopicObject, method='Umap', target='cell', topic_contr=NULL, colorBy=c('KIT', 'PROM1', 'FCGR3A','CD14','S100A12', 'MS4A1', 'CD79A', 'CD3D', 'CD3E', 'CD4', 'CD8A'), cex.legend = 0.8, factor.max=.75, dim=2, legend=TRUE, intervals=10)

Finally, you can save your cisTopic object:

saveRDS(cisTopicObject, file='WarpLDA_cisTopicObject_pbmc.Rds')

References

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.
Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. Handbook of latent semantic analysis, 427(7), 424-440.
Chen, J., Li, K., Zhu, J., & Chen, W. (2016). Warplda: a cache efficient o (1) algorithm for latent dirichlet allocation. Proceedings of the VLDB Endowment, 9(10), 744-755.
Cusanovich, D. A., Reddington, J. P., Garfield, D. A., Daza, R. M., Aghamirzaie, D., Marco-Ferreres, R., ... & Trapnell, C. (2018). The cis-regulatory dynamics of embryonic development at single-cell resolution. Nature, 555(7697), 538.
Aibar, S., Bravo González-Blas, C., Moerman, T., Imrichova, H., Hulselmans, G., Rambow, F., ... & Atak, Z. K. (2017). SCENIC: single-cell regulatory network inference and clustering. Nature methods, 14(11), 1083.
Medina-Rivera, A., Defrance, M., Sand, O., Herrmann, C., Castro-Mondragon, J. A., Delerce, J., ... & Staines, D. M. (2015). RSAT 2015: regulatory sequence analysis tools. Nucleic acids research, 43(W1), W50-W56.
Heinz, S., Benner, C., Spann, N., Bertolino, E., Lin, Y. C., Laslo, P., ... & Glass, C. K. (2010). Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Molecular cell, 38(4), 576-589.
Chiu, T. P., Yang, L., Zhou, T., Main, B. J., Parker, S. C., Nuzhdin, S. V., ... & Rohs, R. (2014). GBshape: a genome browser database for DNA shape annotations. Nucleic acids research, 43(D1), D103-D109.
Pliner, H. A., Shendure, J., & Trapnell, C. (2019). Supervised classification enables rapid annotation of cell atlases. BioRxiv, 538652.

SessionInfo

sessionInfo()

aertslab/cisTopic documentation built on April 6, 2024, 9:31 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

aertslab/cisTopic
cisTopic: Probabilistic modelling of cis-regulatory topics from single cell epigenomics data

In aertslab/cisTopic: cisTopic: Probabilistic modelling of cis-regulatory topics from single cell epigenomics data

Installation

R < 3.5

cisTopic package

Vignette packages

What is cisTopic?

Running cisTopic

Input data

Building the models

Selection of the best model

Interpreting the models

A. Identification of cell states using the cell-cisTopic distributions

Enrichment of epigenomic signatures in the cells

B. Analysis of the regulatory topics

Defining topics

Topic visualization

Enrichment of epigenomic signatures

Annotation to genes and GO terms

(Transcription factor) motif enrichment

Formation of cistromes

Gene accessibility scores

References

SessionInfo

R Package Documentation

Browse R Packages

We want your feedback!

aertslab/cisTopic cisTopic: Probabilistic modelling of cis-regulatory topics from single cell epigenomics data

In aertslab/cisTopic: cisTopic: Probabilistic modelling of cis-regulatory topics from single cell epigenomics data

Installation

R < 3.5

cisTopic package

Vignette packages

What is cisTopic?

Running cisTopic

Input data

Building the models

Selection of the best model

Interpreting the models

A. Identification of cell states using the cell-cisTopic distributions

Enrichment of epigenomic signatures in the cells

B. Analysis of the regulatory topics

Defining topics

Topic visualization

Enrichment of epigenomic signatures

Annotation to genes and GO terms

(Transcription factor) motif enrichment

Formation of cistromes

Gene accessibility scores

References

SessionInfo

R Package Documentation

Browse R Packages

We want your feedback!

aertslab/cisTopic
cisTopic: Probabilistic modelling of cis-regulatory topics from single cell epigenomics data