In CostaLab/scMEGA: Single-cell Multiomic Enhancer-based Gene Regulatory Network Inference

knitr::opts_chunk$set(echo = TRUE)

During this tutorial, we will integrate the snRNA-seq and snATAC-seq data generated from the human heart samples after myocardial infarction. The integrated data will be used as input for inferring gene regulatory network.

We first download the required data. In this case, we need two Seurat objects with each one corresponding to snRNA-seq and snATAC-seq respectively. The snRNA-seq object includes gene expression data of all fibroblasts and the snATAC-seq includes all chromatin accessibility profiles. Additionally, we also need a gene activity matrix for data integration. This matrix was estimated from the snATAC-seq data by using the ArchR package. The script of cleaning the data and preparing these objects is found here.

Run the following commands to download the data:

if(!dir.exists('./Fibroblast')){
    dir.create('./Fibroblast')
}

download.file(url = 'https://costalab.ukaachen.de/open_data/scMEGA/Fibroblast/snRNA.rds', 
              destfile = './Fibroblast/snRNA.rds', 
              method = 'wget', extra = '--no-check-certificate')

download.file(url = 'https://costalab.ukaachen.de/open_data/scMEGA/Fibroblast/snATAC.rds', 
              destfile = './Fibroblast/snATAC.rds', 
              method = 'wget', extra = '--no-check-certificate')

download.file(url = 'https://costalab.ukaachen.de/open_data/scMEGA/Fibroblast/gene.activity.rds', 
              destfile = './Fibroblast/gene.activity.rds', 
              method = 'wget', extra = '--no-check-certificate')

Next, we load all necessary packages:

suppressMessages(library(ArchR))
suppressMessages(library(Seurat))
suppressMessages(library(Signac))
suppressMessages(library(scMEGA))
suppressMessages(library(harmony))
suppressMessages(library(Nebulosa))
suppressMessages(library(ggplot2))
suppressMessages(library(dplyr))
suppressMessages(library(JASPAR2020))
suppressMessages(library(TFBSTools))
suppressMessages(library(igraph))
suppressMessages(library(ggraph))
suppressMessages(library(BSgenome.Hsapiens.UCSC.hg38))

options(future.globals.maxSize = 1e9)

Data integration

Let's load the data into memory and see how they look like

obj.rna <- readRDS("./Fibroblast/snRNA.rds")
obj.atac <- readRDS("./Fibroblast/snATAC.rds")
gene.activity <- readRDS("./Fibroblast/gene.activity.rds")

We need to convert the assays to an Assay5 for Seuratv5

obj.rna[["RNA"]] <- as(obj.rna[["RNA"]], "Assay5")
obj.atac[["ATAC"]] <- as(obj.atac[["ATAC"]], "Assay5")

obj.rna
obj.atac

We can observe that there are 45,515 and 6,481 cells in our snRNA-seq and snATAC-seq datasets. Let's now visualize the data as colored by patients. Note that here we used the UMAP embedding generated from batch-corrected low-dimensional space so that no batch effects are observed from the 2D visualization.

p1 <- DimPlot(obj.rna, pt.size = 1, reduction = "umap_harmony") +
    ggtitle("snRNA-seq")

p2 <- DimPlot(obj.atac, pt.size = 1, reduction = "umap_harmony") +
    ggtitle("snATAC-seq")

p1 + p2

Co-embedding

First, we need to project the data into a common low-dimensional space. This is done by using the CCA method from Seurat. To this end, we have wrapped several functions from Seurat into a single function CoembedData.

obj.coembed <- CoembedData(
  obj.rna,
  obj.atac, 
  gene.activity, 
  weight.reduction = "harmony", 
  verbose = FALSE
)

We next visualize the snRNA-seq and snATAC-seq in this shared UMAP space. The cells are colored by patients or modalities.

p1 <- DimPlot(obj.coembed, group.by = "patient", shuffle = TRUE, label = TRUE)
p2 <- DimPlot(obj.coembed, group.by = "tech", shuffle = TRUE, label = TRUE)

p1 + p2

The batch effects between patients, regions and modalities are quite clear. So next we use Harmony to perform batch correction and generate a new UMAP embedding.

obj.coembed <- RunHarmony(obj.coembed, 
                          group.by.vars = "patient",
                          reduction.use = "pca", 
                          dims.use = 1:30,
                          project.dim = FALSE,
                          plot_convergence = FALSE)

obj.coembed <- RunUMAP(
  obj.coembed,
  dims = 1:30,
  reduction = 'harmony',
  reduction.name = "umap_harmony",
  reduction.key = 'umapharmony_',
  verbose = FALSE
)

We can plot the data again

p1 <-
  DimPlot(obj.coembed, group.by = "patient", reduction = "umap_harmony")
p2 <-
  DimPlot(obj.coembed, group.by = "tech", reduction = "umap_harmony")

p1 + p2

From the new UMAP embedding, we can observe that after batch-correction, cells from different patients, regions, and modalities are well mixed.

Based on our previous works of myofibroblast differentiation in human and mouse kidney, we already known some relevant genes for this biological process. For example, SCARA5 is a marker for myofibroblast progenitor, and COL1A1, POSTN, and FN1 are highly expressed in myofibroblast. Therefore we can visualize the expression of these genes to check if we can also identify similar process in human heart. Note that to make the visualization clear, here we used the package Nebulosa to plot the data.

p1 <-
  plot_density(obj.coembed,
               features = "SCARA5",
               reduction = "umap_harmony",
               pal = "magma")
p2 <-
  plot_density(obj.coembed,
               features = "COL1A1",
               reduction = "umap_harmony",
               pal = "magma")
p3 <-
  plot_density(obj.coembed,
               features = "POSTN",
               reduction = "umap_harmony",
               pal = "magma")
p4 <-
  plot_density(obj.coembed,
               features = "FN1",
               reduction = "umap_harmony",
               pal = "magma")


(p1 + p2) / (p3 + p4)

From the visualization, we can observe that some cells highly express SCARA5 which could be the progenitors of myofibroblasts. On the other hand, some cells highly express COL1A1, POSTN, and FN1 and they could be terminally differentiated myofibroblasts.

Sub-clustering

We next perform sub-clustering to identify different populations in our multi-omic fibroblast data. To further control the data quality, here we will use a two-round approach to remove low-quality cells. We first use a high-resolution to get a large amount of clusters.

obj.coembed <- FindNeighbors(obj.coembed, reduction = "harmony", dims = 1:30)
obj.coembed <- FindClusters(obj.coembed, resolution = 1, verbose = FALSE)

cols <- ArchR::paletteDiscrete(obj.coembed@meta.data[, "RNA_snn_res.1"])

p <- DimPlot(obj.coembed, group.by = "RNA_snn_res.1", label = TRUE,
             reduction = "umap_harmony", shuffle = TRUE) +
    scale_color_manual(values = cols) +
    xlab("UMAP1") + ylab("UMAP2")

p

We can use the function CellPropPlot to visualize the cell propotion across all patients.

p <- CellPropPlot(obj.coembed,
                  group.by = "RNA_snn_res.1",
                  prop.in = "patient_region_id",
                  cols = cols)

p

Since we have annotated all patients into three major groups, i.e., myogenic, ischmeic, and fibrotic. we can also perform statistical test to check if any sub-population are enriched in any of the above group. This can be done by the function CompareCellProp.

The idea is to identify sub-populations that show significant difference.

obj.coembed$patient_group <- factor(obj.coembed$patient_group, 
                                    levels = c("myogenic", "ischemic", "fibrotic"))

p <- CompareCellProp(object = obj.coembed, 
                     group.by = "RNA_snn_res.1", 
                     prop.in = "patient_region_id", 
                      sample.annotation = "patient_group",
                    comparisons = list(c("myogenic", "ischemic"),
                                       c("ischemic", "fibrotic"),
                                       c("myogenic", "fibrotic")))
p

We subset the object to only keep the clusters with significant difference and re-do the clustering

Idents(obj.coembed) <- "RNA_snn_res.1"
coembed.sub <- subset(obj.coembed, idents = c(0, 1, 11, 12, 3, 4, 7, 8, 9))
coembed.sub

# re-do the UMAP
coembed.sub <- RunUMAP(coembed.sub, 
                       dims = 1:30, 
                       reduction = 'harmony',
                       reduction.name = "umap_harmony",
                       reduction.key = 'umap_harmony_',
                       verbose = FALSE)


## re-clustering with a lower resolution
coembed.sub <- FindNeighbors(coembed.sub, reduction = "harmony", dims = 1:30)

coembed.sub <- FindClusters(coembed.sub, resolution = 0.2, verbose = FALSE)

cols <- ArchR::paletteDiscrete(coembed.sub@meta.data[, "RNA_snn_res.0.2"])

p <- DimPlot(coembed.sub, group.by = "RNA_snn_res.0.2", label = TRUE,
             reduction = "umap_harmony", shuffle = TRUE) +
    scale_color_manual(values = cols) +
    xlab("UMAP1") + ylab("UMAP2")

p

Visualize the patient and RNA/ATAC distribution

p1 <- DimPlot(coembed.sub, group.by = "patient", reduction = "umap_harmony", shuffle = TRUE, label = TRUE)
p2 <- DimPlot(coembed.sub, group.by = "tech", reduction = "umap_harmony", shuffle = TRUE, label = TRUE)

p1 + p2

Next, we identify the markers for each cluster and visualize the top 10.

all.markers <- FindAllMarkers(coembed.sub, 
                              only.pos = TRUE, 
                              min.pct = 0.5, 
                              logfc.threshold = 0.5)

df <- all.markers %>%
    group_by(cluster) %>%
    slice_max(n = 10, order_by = avg_log2FC)

p <- DotPlot(coembed.sub, features = unique(df$gene)) + RotatedAxis()

print(p)

The above dot plot demonstrates the top 10 markers per cluster and we can easily classify cluster 1 as myofibroblasts.

Let's compare the cell proportion again:

coembed.sub$patient_group <- factor(coembed.sub$patient_group, 
                                    levels = c("myogenic", "ischemic", "fibrotic"))

p <- CompareCellProp(object = coembed.sub, 
                     group.by = "RNA_snn_res.0.2", 
                     prop.in = "patient_region_id", 
                     sample.annotation = "patient_group",
                     comparisons = list(c("myogenic", "ischemic"),
                                       c("ischemic", "fibrotic"),
                                       c("myogenic", "fibrotic")))

p

Save data

saveRDS(coembed.sub, "./Fibroblast/coembed.sub.rds")

Trajectory analysis

We next identify the trajectory for myofibroblast differentiation.

Dimensionality reduction

To infer trajectory, we will perform dimension reduction using diffusion map via the function RunDiffusionMap. This is based on the R package destiny.

coembed.sub <- RunDiffusionMap(coembed.sub, reduction = "harmony")

cols <- ArchR::paletteDiscrete(coembed.sub@meta.data[, "RNA_snn_res.0.2"])

p <- DimPlot(coembed.sub, group.by = "RNA_snn_res.0.2", label = TRUE,
             reduction = "dm", shuffle = TRUE, cols = cols) +
    xlab("DC 1") + ylab("DC 2")

p

We can also plot snATAC-seq and snRNA-seq individually

DimPlot(coembed.sub, reduction = "dm", 
        group.by = "RNA_snn_res.0.2", split.by = "tech", cols = cols)

Cell pairing

Next, we match the cells between these two modalities. In other words, for each cell in, for example, snATAC-seq, we will find a cell from snRNA-seq data so that these two cells have the similar profiles. This is only necessary when each modality was performed independently. If snRNA-seq and snATAC-seq data was generated by multi-modal protocol, e.g., 10X multiome or SHARE-seq, this step can be skipped.

We here use the method proposed by Kartha, Vinay K., et al. to match the cells.

df.pair <- PairCells(object = coembed.sub, reduction = "harmony",
                     pair.mode = "greedy",
                     pair.by = "tech", ident1 = "ATAC", ident2 = "RNA")

We can visualize the paired cells

sel_cells <- c(df.pair$ATAC, df.pair$RNA)
coembed.sub2 <- coembed.sub[, sel_cells]

options(repr.plot.height = 5, repr.plot.width = 10)
DimPlot(coembed.sub2, reduction = "dm", 
        group.by = "RNA_snn_res.0.2", split.by = "tech", cols = cols)

We next create a new Seurat object for there paired cells as if they are generated by single-cell multimodal protocol.

obj.pair <- CreatePairedObject(df.pair = df.pair, 
                               obj.coembed = coembed.sub2,
                               obj.rna = obj.rna,
                               obj.atac = obj.atac,
                               rna.assay = "RNA", 
                               atac.assay = "ATAC")

obj.pair <- NormalizeData(obj.pair)

Finally, we infer a pseudo-time trajectory from SCARA5+ fibroblasts to myofibroblast using the approach from ArchR. Here we modified the function to allow to take a Seurat object as input

obj.pair <- AddTrajectory(object = obj.pair, 
                          trajectory = c(0, 2, 1),
                          group.by = "RNA_snn_res.0.2", 
                          reduction = "dm",
                          dims = 1:3, 
                          use.all = FALSE)

# we only plot the cells that are in this trajectory
obj <- obj.pair[, !is.na(obj.pair$Trajectory)]

p <- TrajectoryPlot(object = obj, 
                    reduction = "dm",
                    continuousSet = "blueYellow",
                    size = 1,
                   addArrow = FALSE)

p

TF and gene selection

We next select candidate TFs and genes for building a meaningful gene regulatory network.

Select TFs

To identify potential regulator (i.e., TFs), we first estimate an acitivty score for each TF in each cell. This is done by first performing motif matching and then computing deviation scores using chromVAR.

# Get a list of motif position frequency matrices from the JASPAR database
pfm <- getMatrixSet(
  x = JASPAR2020,
  opts = list(collection = "CORE", tax_group = 'vertebrates', all_versions = FALSE)
)

# add motif information
obj <- AddMotifs(
  object = obj,
  genome = BSgenome.Hsapiens.UCSC.hg38,
  pfm = pfm,
    assay = "ATAC"
)

obj <- RunChromVAR(
  object = obj,
  genome = BSgenome.Hsapiens.UCSC.hg38,
    assay = "ATAC"
)

res <- SelectTFs(object = obj, return.heatmap = TRUE)

df.cor <- res$tfs
ht <- res$heatmap

We can visualize the TF activity dynamic along the trajectory

draw(ht)

Select genes

We will select relevant genes by first linking genes to peaks based on the corrleation between gene expression from the snRNA-seq data and peak accessibility from the snATAC-seq data along the inferred trajectory. This means that we only consider a gene to be a potential target if it can be assocaited to at least one peak.

res <- SelectGenes(object = obj,
                  labelTop1 = 0,
                  labelTop2 = 0)

df.p2g <- res$p2g
ht <- res$heatmap

draw(ht)

Gene regulatory network inference

We here will try to predict a gene regulatory network based on the correlation of the TF binding activity as estimated from snATAC-seq and gene expression as measured by snRNA-seq along the trajectory.

tf.gene.cor <- GetTFGeneCorrelation(object = obj, 
                                    tf.use = df.cor$tfs, 
                                    gene.use = unique(df.p2g$gene),
                                    tf.assay = "chromvar", 
                                    gene.assay = "RNA",
                                    trajectory.name = "Trajectory")

We can then visualize this correlation matrix by heatmap. Also, we can cluster the genes and TFs to identify different regulatory modules for the predefined sub-populations.

ht <- GRNHeatmap(tf.gene.cor, 
                 tf.timepoint = df.cor$time_point)
ht

To associate genes to TFs, we will use the peak-to-gene links and TF binding sites information. Specifically, if a gene is regulated by a peak and this peak is bound by a TF, then we consider this gene to be a target of this TF.

motif.matching <- obj@assays$ATAC@motifs@data
colnames(motif.matching) <- obj@assays$ATAC@motifs@motif.names
motif.matching <-
    motif.matching[unique(df.p2g$peak), unique(tf.gene.cor$tf)]


df.grn <- GetGRN(motif.matching = motif.matching, 
                 df.cor = tf.gene.cor, 
                 df.p2g = df.p2g)

Next, we can visualize our network as the last step of this analysis

# define colors for nodes representing TFs (i.e., regulators)
df.cor <- df.cor[order(df.cor$time_point), ]
tfs.timepoint <- df.cor$time_point
names(tfs.timepoint) <- df.cor$tfs

# plot the graph, here we can highlight some genes
df.grn2 <- df.grn %>%
    subset(correlation > 0.4) %>%
    select(c(tf, gene, correlation)) %>%
    rename(weights = correlation)

saveRDS(df.grn2,"./Fibroblast/final_grn.Rds")    

p <- GRNPlot(df.grn2, 
             tfs.timepoint = tfs.timepoint,
             show.tf.labels = TRUE,
             seed = 42, 
             plot.importance = FALSE,
            min.importance = 2,
            remove.isolated = FALSE)

options(repr.plot.height = 20, repr.plot.width = 20)

print(p)

GRN visualization

GRN along the trajectory

Once we generated the gene regulatory network, we can visualize individual TFs in terms of binding activity, expression, and target expression along the pseudotime trajectory.

Here we select two TFs for visualization.

obj <- AddTargetAssay(object = obj, df.grn = df.grn2)

p1 <- PseudotimePlot(object = obj, tf.use = "NR3C2")
p2 <- PseudotimePlot(object = obj, tf.use = "RUNX1")

p1 + p2

The x-axis in above plots present pseudotime point along the trajectory, and the y-axis represent TF binding acitivty, TF expression, and TF target expression after z-score transformation.

Save data

saveRDS(obj, "./Fibroblast/coembed.trajectory.rds")

# Check session information
sessionInfo()

CostaLab/scMEGA documentation built on Sept. 25, 2024, 6:11 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

CostaLab/scMEGA
Single-cell Multiomic Enhancer-based Gene Regulatory Network Inference

In CostaLab/scMEGA: Single-cell Multiomic Enhancer-based Gene Regulatory Network Inference

Data integration

Co-embedding

Sub-clustering

Trajectory analysis

Dimensionality reduction

Cell pairing

TF and gene selection

Select TFs

Select genes

Gene regulatory network inference

GRN visualization

GRN along the trajectory

R Package Documentation

Browse R Packages

We want your feedback!

CostaLab/scMEGA Single-cell Multiomic Enhancer-based Gene Regulatory Network Inference

In CostaLab/scMEGA: Single-cell Multiomic Enhancer-based Gene Regulatory Network Inference

Data integration

Co-embedding

Sub-clustering

Trajectory analysis

Dimensionality reduction

Cell pairing

TF and gene selection

Select TFs

Select genes

Gene regulatory network inference

GRN visualization

GRN along the trajectory

R Package Documentation

Browse R Packages

We want your feedback!

CostaLab/scMEGA
Single-cell Multiomic Enhancer-based Gene Regulatory Network Inference