library(Signac) library(Seurat) library(ggplot2) library(patchwork) library(EnsDb.Hsapiens.v75)
Here, we take a look at two different datasets containing both DNA accessibility measurements and mitochondrial mutation data in the same cells. One was sampled from a patient with a colorectal cancer (CRC) tumor, and the other is from a polyclonal TF1 cell line. This data was produced by Lareau and Ludwig et al. (2020), and you can read the original paper here: https://doi.org/10.1038/s41587-020-0645-6.
Processed data files, including mitochondrial variant data for the CRC and TF1 dataset is available on Zenodo here: https://zenodo.org/record/3977808
Raw sequencing data and DNA accessibility processed files for the these datasets are available on NCBI GEO here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE142745
View data download code
The required files can be downloaded by running the following lines in a shell:
```{bash eval=FALSE}
wget https://zenodo.org/record/3977808/files/CRC_v12-mtMask_mgatk.filtered_peak_bc_matrix.h5 wget https://zenodo.org/record/3977808/files/CRC_v12-mtMask_mgatk.singlecell.csv wget https://zenodo.org/record/3977808/files/CRC_v12-mtMask_mgatk.fragments.tsv.gz wget https://zenodo.org/record/3977808/files/CRC_v12-mtMask_mgatk.fragments.tsv.gz.tbi
wget https://zenodo.org/record/3977808/files/CRC_v12-mtMask_mgatk.A.txt.gz wget https://zenodo.org/record/3977808/files/CRC_v12-mtMask_mgatk.C.txt.gz wget https://zenodo.org/record/3977808/files/CRC_v12-mtMask_mgatk.G.txt.gz wget https://zenodo.org/record/3977808/files/CRC_v12-mtMask_mgatk.T.txt.gz wget https://zenodo.org/record/3977808/files/CRC_v12-mtMask_mgatk.depthTable.txt wget https://zenodo.org/record/3977808/files/CRC_v12-mtMask_mgatk.chrM_refAllele.txt
</details> # Colorectal cancer dataset To demonstrate combined analyses of mitochondrial DNA variants and accessible chromatin, we'll walk through a vignette analyzing cells from a primary colorectal adenocarcinoma. The sample contains a mixture of malignant epithelial cells and tumor infiltrating immune cells. ## Loading the DNA accessibility data First we load the scATAC-seq data and create a Seurat object following the standard workflow for scATAC-seq data. ```r # load counts and metadata from cellranger-atac counts <- Read10X_h5(filename = "../vignette_data/mito/CRC_v12-mtMask_mgatk.filtered_peak_bc_matrix.h5") metadata <- read.csv( file = "../vignette_data/mito/CRC_v12-mtMask_mgatk.singlecell.csv", header = TRUE, row.names = 1 ) # load gene annotations from Ensembl annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v75) # change to UCSC style since the data was mapped to hg19 seqlevels(annotations) <- paste0('chr', seqlevels(annotations)) genome(annotations) <- "hg19" # create object crc_assay <- CreateChromatinAssay( counts = counts, sep = c(":", "-"), annotation = annotations, min.cells = 10, fragments = '../vignette_data/mito/CRC_v12-mtMask_mgatk.fragments.tsv.gz' ) crc <- CreateSeuratObject( counts = crc_assay, assay = 'peaks', meta.data = metadata ) crc[["peaks"]]
We can compute the standard quality control metrics for scATAC-seq and filter out low-quality cells based on these metrics.
# Augment QC metrics that were computed by cellranger-atac crc$pct_reads_in_peaks <- crc$peak_region_fragments / crc$passed_filters * 100 crc$pct_reads_in_DNase <- crc$DNase_sensitive_region_fragments / crc$passed_filters * 100 crc$blacklist_ratio <- crc$blacklist_region_fragments / crc$peak_region_fragments # compute TSS enrichment score and nucleosome banding pattern crc <- TSSEnrichment(crc) crc <- NucleosomeSignal(crc)
# visualize QC metrics for each cell VlnPlot(crc, c("TSS.enrichment", "nCount_peaks", "nucleosome_signal", "pct_reads_in_peaks", "pct_reads_in_DNase", "blacklist_ratio"), pt.size = 0, ncol = 3)
# remove low-quality cells crc <- subset( x = crc, subset = nCount_peaks > 1000 & nCount_peaks < 50000 & pct_reads_in_DNase > 40 & blacklist_ratio < 0.05 & TSS.enrichment > 3 & nucleosome_signal < 4 ) crc
Next we can load the mitochondrial DNA variant data for these cells that was
quantified using mgatk. The ReadMGATK()
function in Signac allows the output from mgatk
to be read directly into R in
a convenient format for downstream analysis with Signac. Here, we load the data
and add it to the Seurat object as a new assay.
# load mgatk output mito.data <- ReadMGATK(dir = "../vignette_data/mito/crc/") # create an assay mito <- CreateAssayObject(counts = mito.data$counts) # Subset to cell present in the scATAC-seq assat mito <- subset(mito, cells = Cells(crc)) # add assay and metadata to the seurat object crc[["mito"]] <- mito crc <- AddMetaData(crc, metadata = mito.data$depth[Cells(mito), ], col.name = "mtDNA_depth")
We can look at the mitochondrial sequencing depth for each cell, and further subset the cells based on mitochondrial sequencing depth.
VlnPlot(crc, "mtDNA_depth", pt.size = 0.1) + scale_y_log10()
# filter cells based on mitochondrial depth crc <- subset(crc, mtDNA_depth >= 10) crc
Next we can run a standard dimension reduction and clustering workflow using the scATAC-seq data to identify cell clusters.
crc <- RunTFIDF(crc) crc <- FindTopFeatures(crc, min.cutoff = 10) crc <- RunSVD(crc) crc <- RunUMAP(crc, reduction = "lsi", dims = 2:50) crc <- FindNeighbors(crc, reduction = "lsi", dims = 2:50) crc <- FindClusters(crc, resolution = 0.7, algorithm = 3)
DimPlot(crc, label = TRUE) + NoLegend()
To help interpret these clusters of cells, and assign a cell type label, we'll estimate gene activities by summing the DNA accessibility in the gene body and promoter region.
# compute gene accessibility gene.activities <- GeneActivity(crc) # add to the Seurat object as a new assay crc[['RNA']] <- CreateAssayObject(counts = gene.activities) crc <- NormalizeData( object = crc, assay = 'RNA', normalization.method = 'LogNormalize', scale.factor = median(crc$nCount_RNA) )
We note the following markers for different cell types in the CRC dataset:
DefaultAssay(crc) <- 'RNA' FeaturePlot( object = crc, features = c('TREM1', 'EPCAM', "PTPRC", "IL1RL1","GATA3", "KIT"), pt.size = 0.1, max.cutoff = 'q95', ncol = 2 )
Using these gene score values, we can assign cluster identities:
crc <- RenameIdents( object = crc, '0' = 'Epithelial', '1' = 'Epithelial', '2' = 'Basophil', '3' = 'Myeloid_1', '4' = 'Myeloid_2', '5' = 'Tcell' )
One of the myeloid clusters has a lower percentage of fragments in peaks, as well as a lower overall mitochondrial sequencing depth and a different nucleosome banding pattern.
p1 <- FeatureScatter(crc, "mtDNA_depth", "pct_reads_in_peaks") + ggtitle("") + scale_x_log10() p2 <- FeatureScatter(crc, "mtDNA_depth", "nucleosome_signal") + ggtitle("") + scale_x_log10() p1 + p2 + plot_layout(guides = 'collect')
We can see that most of the low FRIP cells were the myeloid 1
cluster. This is
most likely an intra-tumor granulocyte that has relatively poor accessible
chromatin enrichment. Similarly, the unusual nuclear chromatin packaging of this
cell type yields slightly reduced mtDNA coverage compared to the myeloid 2
cluster.
Next, we can identify sites in the mitochondrial genome that vary across cells, and cluster the cells into clonotypes based on the frequency of these variants in the cells. Signac utilizes the principles established in the original mtscATAC-seq work of identifying high-quality variants.
variable.sites <- IdentifyVariants(crc, assay = "mito", refallele = mito.data$refallele) VariantPlot(variants = variable.sites)
The plot above clearly shows a group of variants with a higher VMR and strand concordance. In principle, a high strand concordance reduces the likelihood of the allele frequency being driven by sequencing error (which predominately occurs on one but not the other strand. This is due to the preceding nucleotide content and a common error in mtDNA genotyping). On the other hand, variants that have a high VMR are more likely to be clonal variants as the alternate alleles tend to aggregate in certain cells rather than be equivalently dispersed about all cells, which would be indicative of some other artifact.
We note that variants that have a very low VMR and and very high strand concordance are homoplasmic variants for this sample. While these may be interesting in some settings (e.g. donor demultiplexing), for inferring subclones, these are not particularly useful.
Based on these thresholds, we can filter out a set of informative mitochondrial variants that differ across the cells.
# Establish a filtered data frame of variants based on this processing high.conf <- subset( variable.sites, subset = n_cells_conf_detected >= 5 & strand_correlation >= 0.65 & vmr > 0.01 ) high.conf[,c(1,2,5)]
A few things stand out. First, 10 out of the 12 variants occur at less than 1% allele frequency in the population. However, 16147C>T is present at about 62%. We'll see that this is a clonal variant marking the epithelial cells. Additionally, all of the called variants are transitions (A - G or C - T) rather than transversion mutations (A - T or C - G). This fits what we know about how these mutations arise in the mitochondrial genome.
Depending on your analytical question, these thresholds can be adjusted to identify variants that are more prevalent in other cells.
We currently have information for each strand stored in the mito assay to allow
strand concordance to be assessed. Now that we have our set of high-confidence
informative variants, we can create a new assay containing strand-collapsed
allele frequency counts for each cell for these variants using the AlleleFreq()
function.
crc <- AlleleFreq( object = crc, variants = high.conf$variant, assay = "mito" ) crc[["alleles"]]
Now that the allele frequencies are stored as an additional assay, we can use
the standard functions in Seurat to visualize how these allele frequencies are
distributed across the cells. Here we visualize a subset of the variants using
FeaturePlot()
and DoHeatmap()
.
DefaultAssay(crc) <- "alleles" alleles.view <- c("12889G>A", "16147C>T", "9728C>T", "9804G>A") FeaturePlot( object = crc, features = alleles.view, order = TRUE, cols = c("grey", "darkred"), ncol = 4 ) & NoLegend()
DoHeatmap(crc, features = rownames(crc), slot = "data", disp.max = 1) + scale_fill_viridis_c()
Here, we can see a few interesting patterns for the selected variants. 16147C>T is present in essentially all epithelial cells and almost exclusively in epithelial cells (the edge cases where this isn't true are also cases where the UMAP and clustering don't full agree). It is at 100% allele frequency-- strongly suggestive of whatever cell of origin of this tumor had the mutation at 100% and then expanded. We then see at least 3 variants 1227G>A, 12889G>A, and 9728C>T that are mostly present specifically in the epithelial cells that define subclones. Other variants including 3244G>A, 9804G>A, and 824T>C are found specifically in immune cell populations, suggesting that these arose from a common hematopoetic progenitor cell (probably in the bone marrow).
Next we'll demonstrate a similar workflow to identify cell clones in a different dataset, this time generated from a TF1 cell line. This dataset contains more clones present at a higher proportion, based on the experimental design.
We'll demonstrate how to identify groups of related cells (clones) by clustering the allele frequency data and how to relate these clonal groups to accessibility differences utilizing the multimodal capabilities of Signac.
View data download code
To download the data from Zenodo run the following in a shell:
```{bash eval=FALSE}
wget https://zenodo.org/record/3977808/files/TF1.filtered.fragments.tsv.gz wget https://zenodo.org/record/3977808/files/TF1.filtered.fragments.tsv.gz.tbi wget https://zenodo.org/record/3977808/files/TF1.filtered.narrowPeak.gz
wget https://zenodo.org/record/3977808/files/TF1_filtered.A.txt.gz wget https://zenodo.org/record/3977808/files/TF1_filtered.T.txt.gz wget https://zenodo.org/record/3977808/files/TF1_filtered.C.txt.gz wget https://zenodo.org/record/3977808/files/TF1_filtered.G.txt.gz wget https://zenodo.org/record/3977808/files/TF1_filtered.chrM_refAllele.txt.gz wget https://zenodo.org/record/3977808/files/TF1_filtered.depthTable.txt.gz
</details> ```r # read the mitochondrial data tf1.data <- ReadMGATK(dir = "../vignette_data/mito/tf1/") # create a Seurat object tf1 <- CreateSeuratObject( counts = tf1.data$counts, meta.data = tf1.data$depth, assay = "mito" ) # load the peak set peaks <- read.table( file = "../vignette_data/mito/TF1.filtered.narrowPeak.gz", sep = "\t", col.names = c("chrom", "start", "end", "peak", "width", "strand", "x", "y", "z", "w") ) peaks <- makeGRangesFromDataFrame(peaks) # create fragment object frags <- CreateFragmentObject( path = "../vignette_data/mito/TF1.filtered.fragments.tsv.gz", cells = colnames(tf1) ) # quantify the DNA accessibility data counts <- FeatureMatrix( fragments = frags, features = peaks, cells = colnames(tf1) ) # create assay with accessibility data and add it to the Seurat object tf1[["peaks"]] <- CreateChromatinAssay( counts = counts, fragments = frags )
# add annotations Annotation(tf1[["peaks"]]) <- annotations
DefaultAssay(tf1) <- "peaks" tf1 <- NucleosomeSignal(tf1) tf1 <- TSSEnrichment(tf1)
VlnPlot(tf1, c("nCount_peaks", "nucleosome_signal", "TSS.enrichment"), pt.size = 0.1)
tf1 <- subset( x = tf1, subset = nCount_peaks > 500 & nucleosome_signal < 2 & TSS.enrichment > 2.5 ) tf1
DefaultAssay(tf1) <- "mito" variants <- IdentifyVariants(tf1, refallele = tf1.data$refallele) VariantPlot(variants)
high.conf <- subset( variants, subset = n_cells_conf_detected >= 5 & strand_correlation >= 0.65 & vmr > 0.01 )
tf1 <- AlleleFreq(tf1, variants = high.conf$variant, assay = "mito") tf1[["alleles"]]
Now that we've identified a set of variable alleles, we can cluster the cells
based on the frequency of each of these alleles using the FindClonotypes()
function. This uses the Louvain community detection algorithm implemented in
Seurat.
DefaultAssay(tf1) <- "alleles" tf1 <- FindClonotypes(tf1)
table(Idents(tf1))
Here we see that the clonal clustering has identified 12 different clones in the
TF1 dataset. We can further visualize the frequency of alleles in these clones
using DoHeatmap()
. The FindClonotypes()
function also performs hierarchical
clustering on both the clonotypes and the alleles, and sets the factor levels
for the clonotypes based on the hierarchical clustering order, and the order of
variable features based on the hierarchical feature clustering. This allows us
to get a decent ordering of both features and clones automatically:
DoHeatmap(tf1, features = VariableFeatures(tf1), slot = "data", disp.max = 0.1) + scale_fill_viridis_c()
Next we can use the clonal information derived from the mitochondrial assay to find peaks that are differentially accessible between clones.
DefaultAssay(tf1) <- "peaks" # find peaks specific to one clone markers.fast <- FoldChange(tf1, ident.1 = 2) markers.fast <- markers.fast[order(markers.fast$avg_log2FC, decreasing = TRUE), ] # sort by fold change head(markers.fast)
We can the DNA accessibility in these regions for each clone using
the CoveragePlot()
function. As you can see, the peaks identified are highly
specific to one clone.
CoveragePlot( object = tf1, region = rownames(markers.fast)[1], extend.upstream = 2000, extend.downstream = 2000 )
Session Info
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.