In timoast/signac: Analysis of Single-Cell Chromatin Data

library(Signac)
library(Seurat)
library(ggplot2)
library(patchwork)
library(EnsDb.Hsapiens.v75)

Here, we take a look at two different datasets containing both DNA accessibility measurements and mitochondrial mutation data in the same cells. One was sampled from a patient with a colorectal cancer (CRC) tumor, and the other is from a polyclonal TF1 cell line. This data was produced by Lareau and Ludwig et al. (2020), and you can read the original paper here: https://doi.org/10.1038/s41587-020-0645-6.

Processed data files, including mitochondrial variant data for the CRC and TF1 dataset is available on Zenodo here: https://zenodo.org/record/3977808

Raw sequencing data and DNA accessibility processed files for the these datasets are available on NCBI GEO here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE142745

View data download code

The required files can be downloaded by running the following lines in a shell:

```{bash eval=FALSE}

ATAC data

wget https://zenodo.org/record/3977808/files/CRC_v12-mtMask_mgatk.filtered_peak_bc_matrix.h5 wget https://zenodo.org/record/3977808/files/CRC_v12-mtMask_mgatk.singlecell.csv wget https://zenodo.org/record/3977808/files/CRC_v12-mtMask_mgatk.fragments.tsv.gz wget https://zenodo.org/record/3977808/files/CRC_v12-mtMask_mgatk.fragments.tsv.gz.tbi

mitochondrial allele data

wget https://zenodo.org/record/3977808/files/CRC_v12-mtMask_mgatk.A.txt.gz wget https://zenodo.org/record/3977808/files/CRC_v12-mtMask_mgatk.C.txt.gz wget https://zenodo.org/record/3977808/files/CRC_v12-mtMask_mgatk.G.txt.gz wget https://zenodo.org/record/3977808/files/CRC_v12-mtMask_mgatk.T.txt.gz wget https://zenodo.org/record/3977808/files/CRC_v12-mtMask_mgatk.depthTable.txt wget https://zenodo.org/record/3977808/files/CRC_v12-mtMask_mgatk.chrM_refAllele.txt

</details>


# Colorectal cancer dataset

To demonstrate combined analyses of mitochondrial DNA variants and accessible
chromatin, we'll walk through a vignette analyzing cells from a primary
colorectal adenocarcinoma. The sample contains a mixture of malignant
epithelial cells and tumor infiltrating immune cells. 

## Loading the DNA accessibility data

First we load the scATAC-seq data and create a Seurat object following the 
standard workflow for scATAC-seq data.

```r
# load counts and metadata from cellranger-atac
counts <- Read10X_h5(filename = "../vignette_data/mito/CRC_v12-mtMask_mgatk.filtered_peak_bc_matrix.h5")
metadata <- read.csv(
  file = "../vignette_data/mito/CRC_v12-mtMask_mgatk.singlecell.csv",
  header = TRUE,
  row.names = 1
)

# load gene annotations from Ensembl
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v75)

# change to UCSC style since the data was mapped to hg19
seqlevels(annotations) <- paste0('chr', seqlevels(annotations))
genome(annotations) <- "hg19"

# create object
crc_assay <- CreateChromatinAssay(
  counts = counts,
  sep = c(":", "-"),
  annotation = annotations,
  min.cells = 10,
  fragments = '../vignette_data/mito/CRC_v12-mtMask_mgatk.fragments.tsv.gz'
)
crc <- CreateSeuratObject(
  counts = crc_assay,
  assay = 'peaks',
  meta.data = metadata
)

crc[["peaks"]]

Quality control

We can compute the standard quality control metrics for scATAC-seq and filter out low-quality cells based on these metrics.

# Augment QC metrics that were computed by cellranger-atac
crc$pct_reads_in_peaks <- crc$peak_region_fragments / crc$passed_filters * 100
crc$pct_reads_in_DNase <- crc$DNase_sensitive_region_fragments / crc$passed_filters * 100
crc$blacklist_ratio <- crc$blacklist_region_fragments / crc$peak_region_fragments

# compute TSS enrichment score and nucleosome banding pattern
crc <- TSSEnrichment(crc)
crc <- NucleosomeSignal(crc)

# visualize QC metrics for each cell
VlnPlot(crc, c("TSS.enrichment", "nCount_peaks", "nucleosome_signal", "pct_reads_in_peaks", "pct_reads_in_DNase", "blacklist_ratio"), pt.size = 0, ncol = 3)

# remove low-quality cells
crc <- subset(
  x = crc,
  subset = nCount_peaks > 1000 &
    nCount_peaks < 50000 &
    pct_reads_in_DNase > 40 &
    blacklist_ratio < 0.05 &
    TSS.enrichment > 3 & 
    nucleosome_signal < 4
)
crc

Loading the mitochondrial variant data

Next we can load the mitochondrial DNA variant data for these cells that was quantified using mgatk. The ReadMGATK() function in Signac allows the output from mgatk to be read directly into R in a convenient format for downstream analysis with Signac. Here, we load the data and add it to the Seurat object as a new assay.

# load mgatk output
mito.data <- ReadMGATK(dir = "../vignette_data/mito/crc/")

# create an assay
mito <- CreateAssayObject(counts = mito.data$counts)

# Subset to cell present in the scATAC-seq assat
mito <- subset(mito, cells = Cells(crc))

# add assay and metadata to the seurat object
crc[["mito"]] <- mito
crc <- AddMetaData(crc, metadata = mito.data$depth[Cells(mito), ], col.name = "mtDNA_depth")

We can look at the mitochondrial sequencing depth for each cell, and further subset the cells based on mitochondrial sequencing depth.

VlnPlot(crc, "mtDNA_depth", pt.size = 0.1) + scale_y_log10()

# filter cells based on mitochondrial depth
crc <- subset(crc, mtDNA_depth >= 10)
crc

Dimension reduction and clustering

Next we can run a standard dimension reduction and clustering workflow using the scATAC-seq data to identify cell clusters.

crc <- RunTFIDF(crc)
crc <- FindTopFeatures(crc, min.cutoff = 10)
crc <- RunSVD(crc)
crc <- RunUMAP(crc, reduction = "lsi", dims = 2:50)
crc <- FindNeighbors(crc, reduction = "lsi", dims = 2:50)
crc <- FindClusters(crc, resolution = 0.7, algorithm = 3)

DimPlot(crc, label = TRUE) + NoLegend()

Generate gene scores

To help interpret these clusters of cells, and assign a cell type label, we'll estimate gene activities by summing the DNA accessibility in the gene body and promoter region.

# compute gene accessibility
gene.activities <- GeneActivity(crc)

# add to the Seurat object as a new assay
crc[['RNA']] <- CreateAssayObject(counts = gene.activities)

crc <- NormalizeData(
  object = crc,
  assay = 'RNA',
  normalization.method = 'LogNormalize',
  scale.factor = median(crc$nCount_RNA)
)

Visualize interesting gene activity scores

We note the following markers for different cell types in the CRC dataset:

EPCAM is a marker for epithelial cells
TREM1 is a meyloid marker
PTPRC = CD45 is a pan-immune cell marker
IL1RL1 is a basophil marker
GATA3 is a Tcell maker

DefaultAssay(crc) <- 'RNA'

FeaturePlot(
  object = crc,
  features = c('TREM1', 'EPCAM', "PTPRC", "IL1RL1","GATA3", "KIT"),
  pt.size = 0.1,
  max.cutoff = 'q95',
  ncol = 2
)

Using these gene score values, we can assign cluster identities:

crc <- RenameIdents(
  object = crc,
  '0' = 'Epithelial',
  '1' = 'Epithelial',
  '2' = 'Basophil',
  '3' = 'Myeloid_1',
  '4' = 'Myeloid_2',
  '5' = 'Tcell'
)

One of the myeloid clusters has a lower percentage of fragments in peaks, as well as a lower overall mitochondrial sequencing depth and a different nucleosome banding pattern.

p1 <- FeatureScatter(crc, "mtDNA_depth", "pct_reads_in_peaks") + ggtitle("") + scale_x_log10()
p2 <- FeatureScatter(crc, "mtDNA_depth", "nucleosome_signal") + ggtitle("") + scale_x_log10()

p1 + p2 + plot_layout(guides = 'collect')

We can see that most of the low FRIP cells were the myeloid 1 cluster. This is most likely an intra-tumor granulocyte that has relatively poor accessible chromatin enrichment. Similarly, the unusual nuclear chromatin packaging of this cell type yields slightly reduced mtDNA coverage compared to the myeloid 2 cluster.

Find informative mtDNA variants

Next, we can identify sites in the mitochondrial genome that vary across cells, and cluster the cells into clonotypes based on the frequency of these variants in the cells. Signac utilizes the principles established in the original mtscATAC-seq work of identifying high-quality variants.

variable.sites <- IdentifyVariants(crc, assay = "mito", refallele = mito.data$refallele)
VariantPlot(variants = variable.sites)

The plot above clearly shows a group of variants with a higher VMR and strand concordance. In principle, a high strand concordance reduces the likelihood of the allele frequency being driven by sequencing error (which predominately occurs on one but not the other strand. This is due to the preceding nucleotide content and a common error in mtDNA genotyping). On the other hand, variants that have a high VMR are more likely to be clonal variants as the alternate alleles tend to aggregate in certain cells rather than be equivalently dispersed about all cells, which would be indicative of some other artifact.

We note that variants that have a very low VMR and and very high strand concordance are homoplasmic variants for this sample. While these may be interesting in some settings (e.g. donor demultiplexing), for inferring subclones, these are not particularly useful.

Based on these thresholds, we can filter out a set of informative mitochondrial variants that differ across the cells.

# Establish a filtered data frame of variants based on this processing
high.conf <- subset(
  variable.sites, subset = n_cells_conf_detected >= 5 &
    strand_correlation >= 0.65 &
    vmr > 0.01
)

high.conf[,c(1,2,5)]

A few things stand out. First, 10 out of the 12 variants occur at less than 1% allele frequency in the population. However, 16147C>T is present at about 62%. We'll see that this is a clonal variant marking the epithelial cells. Additionally, all of the called variants are transitions (A - G or C - T) rather than transversion mutations (A - T or C - G). This fits what we know about how these mutations arise in the mitochondrial genome.

Depending on your analytical question, these thresholds can be adjusted to identify variants that are more prevalent in other cells.

Compute the variant allele frequency for each cell

We currently have information for each strand stored in the mito assay to allow strand concordance to be assessed. Now that we have our set of high-confidence informative variants, we can create a new assay containing strand-collapsed allele frequency counts for each cell for these variants using the AlleleFreq() function.

crc <- AlleleFreq(
  object = crc,
  variants = high.conf$variant,
  assay = "mito"
)
crc[["alleles"]]

Visualize the variants

Now that the allele frequencies are stored as an additional assay, we can use the standard functions in Seurat to visualize how these allele frequencies are distributed across the cells. Here we visualize a subset of the variants using FeaturePlot() and DoHeatmap().

DefaultAssay(crc) <- "alleles"
alleles.view <- c("12889G>A", "16147C>T", "9728C>T", "9804G>A")
FeaturePlot(
  object = crc,
  features = alleles.view,
  order = TRUE,
  cols = c("grey", "darkred"),
  ncol = 4
) & NoLegend()

DoHeatmap(crc, features = rownames(crc), slot = "data", disp.max = 1) +
  scale_fill_viridis_c()

Here, we can see a few interesting patterns for the selected variants. 16147C>T is present in essentially all epithelial cells and almost exclusively in epithelial cells (the edge cases where this isn't true are also cases where the UMAP and clustering don't full agree). It is at 100% allele frequency-- strongly suggestive of whatever cell of origin of this tumor had the mutation at 100% and then expanded. We then see at least 3 variants 1227G>A, 12889G>A, and 9728C>T that are mostly present specifically in the epithelial cells that define subclones. Other variants including 3244G>A, 9804G>A, and 824T>C are found specifically in immune cell populations, suggesting that these arose from a common hematopoetic progenitor cell (probably in the bone marrow).

TF1 cell line dataset

Next we'll demonstrate a similar workflow to identify cell clones in a different dataset, this time generated from a TF1 cell line. This dataset contains more clones present at a higher proportion, based on the experimental design.

We'll demonstrate how to identify groups of related cells (clones) by clustering the allele frequency data and how to relate these clonal groups to accessibility differences utilizing the multimodal capabilities of Signac.

Data loading

View data download code

To download the data from Zenodo run the following in a shell:

```{bash eval=FALSE}

ATAC data

wget https://zenodo.org/record/3977808/files/TF1.filtered.fragments.tsv.gz wget https://zenodo.org/record/3977808/files/TF1.filtered.fragments.tsv.gz.tbi wget https://zenodo.org/record/3977808/files/TF1.filtered.narrowPeak.gz

mitochondrial genome data

wget https://zenodo.org/record/3977808/files/TF1_filtered.A.txt.gz wget https://zenodo.org/record/3977808/files/TF1_filtered.T.txt.gz wget https://zenodo.org/record/3977808/files/TF1_filtered.C.txt.gz wget https://zenodo.org/record/3977808/files/TF1_filtered.G.txt.gz wget https://zenodo.org/record/3977808/files/TF1_filtered.chrM_refAllele.txt.gz wget https://zenodo.org/record/3977808/files/TF1_filtered.depthTable.txt.gz

</details>

```r
# read the mitochondrial data
tf1.data <- ReadMGATK(dir = "../vignette_data/mito/tf1/")

# create a Seurat object
tf1 <- CreateSeuratObject(
  counts = tf1.data$counts,
  meta.data = tf1.data$depth,
  assay = "mito"
)

# load the peak set
peaks <- read.table(
  file = "../vignette_data/mito/TF1.filtered.narrowPeak.gz",
  sep = "\t",
  col.names = c("chrom", "start", "end", "peak", "width", "strand", "x", "y", "z", "w")
)
peaks <- makeGRangesFromDataFrame(peaks)

# create fragment object
frags <- CreateFragmentObject(
  path = "../vignette_data/mito/TF1.filtered.fragments.tsv.gz",
  cells = colnames(tf1)
)

# quantify the DNA accessibility data
counts <- FeatureMatrix(
  fragments = frags,
  features = peaks,
  cells = colnames(tf1)
)

# create assay with accessibility data and add it to the Seurat object
tf1[["peaks"]] <- CreateChromatinAssay(
  counts = counts,
  fragments = frags
)

Quality control

# add annotations
Annotation(tf1[["peaks"]]) <- annotations

DefaultAssay(tf1) <- "peaks"

tf1 <- NucleosomeSignal(tf1)
tf1 <- TSSEnrichment(tf1)

VlnPlot(tf1, c("nCount_peaks", "nucleosome_signal", "TSS.enrichment"), pt.size = 0.1)

tf1 <- subset(
  x = tf1,
  subset = nCount_peaks > 500 &
    nucleosome_signal < 2 &
    TSS.enrichment > 2.5
)
tf1

Identifying variants

DefaultAssay(tf1) <- "mito"
variants <- IdentifyVariants(tf1, refallele = tf1.data$refallele)
VariantPlot(variants)

high.conf <- subset(
  variants, subset = n_cells_conf_detected >= 5 &
    strand_correlation >= 0.65 &
    vmr > 0.01
)

tf1 <- AlleleFreq(tf1, variants = high.conf$variant, assay = "mito")
tf1[["alleles"]]

Identifying clones

Now that we've identified a set of variable alleles, we can cluster the cells based on the frequency of each of these alleles using the FindClonotypes() function. This uses the Louvain community detection algorithm implemented in Seurat.

DefaultAssay(tf1) <- "alleles"
tf1 <- FindClonotypes(tf1)

table(Idents(tf1))

Here we see that the clonal clustering has identified 12 different clones in the TF1 dataset. We can further visualize the frequency of alleles in these clones using DoHeatmap(). The FindClonotypes() function also performs hierarchical clustering on both the clonotypes and the alleles, and sets the factor levels for the clonotypes based on the hierarchical clustering order, and the order of variable features based on the hierarchical feature clustering. This allows us to get a decent ordering of both features and clones automatically:

DoHeatmap(tf1, features = VariableFeatures(tf1), slot = "data", disp.max = 0.1) +
  scale_fill_viridis_c()

Find differentially accessible peaks between clones

Next we can use the clonal information derived from the mitochondrial assay to find peaks that are differentially accessible between clones.

DefaultAssay(tf1) <- "peaks"

# find peaks specific to one clone
markers.fast <- FoldChange(tf1, ident.1 = 2)
markers.fast <- markers.fast[order(markers.fast$avg_log2FC, decreasing = TRUE), ] # sort by fold change
head(markers.fast)

We can the DNA accessibility in these regions for each clone using the CoveragePlot() function. As you can see, the peaks identified are highly specific to one clone.

CoveragePlot(
  object = tf1,
  region = rownames(markers.fast)[1],
  extend.upstream = 2000,
  extend.downstream = 2000
)

Session Info

sessionInfo()

timoast/signac documentation built on April 5, 2024, 1:34 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

timoast/signac
Analysis of Single-Cell Chromatin Data

In timoast/signac: Analysis of Single-Cell Chromatin Data

ATAC data

mitochondrial allele data

Quality control

Loading the mitochondrial variant data

Dimension reduction and clustering

Generate gene scores

Visualize interesting gene activity scores

Find informative mtDNA variants

Compute the variant allele frequency for each cell

Visualize the variants

TF1 cell line dataset

Data loading

ATAC data

mitochondrial genome data

Quality control

Identifying variants

Identifying clones

Find differentially accessible peaks between clones

R Package Documentation

Browse R Packages

We want your feedback!

timoast/signac Analysis of Single-Cell Chromatin Data

In timoast/signac: Analysis of Single-Cell Chromatin Data

ATAC data

mitochondrial allele data

Quality control

Loading the mitochondrial variant data

Dimension reduction and clustering

Generate gene scores

Visualize interesting gene activity scores

Find informative mtDNA variants

Compute the variant allele frequency for each cell

Visualize the variants

TF1 cell line dataset

Data loading

ATAC data

mitochondrial genome data

Quality control

Identifying variants

Identifying clones

Find differentially accessible peaks between clones

R Package Documentation

Browse R Packages

We want your feedback!

timoast/signac
Analysis of Single-Cell Chromatin Data