knitr::opts_chunk$set(echo = TRUE,
  tidy.opts=list(width.cutoff=70),
  tidy=TRUE)

Overview

In this vignette, we demonstrate the ability of mitoClone2 to identify mitochondrial variants from single-cell sequencing data of a single or multiple individuals without any a priori knowledge of pre-existing variants by capitalizing on either an extensive exclusionlist (that removes known/potential false-positive signals) OR information shared across multiple individuals. We note here that of these methods, the second, is the most powerful. The latter strategy relies on the assumption that biological noise (e.g. RNA editing) and technical noise (arising during either library preparation, RNA sequencing, read processing, or alignment steps of the experiment) should be consistent across samples from different individuals. On this end, each individual sample can then act as essentially a technical/pseudo-replicate for identification and elimination the mentioned artifacts. In situations where a single sample from one individual is analyzed, we fall back on our standard filtering method using pre-constructed exclusionlists including information related to homo-polymers, repeat regions, RNA editing sites, and known sequencing artifacts.

library(mitoClone2)

Introduction

A major untapped resource in identifying clonal populations in single-cells can be found in the mitochondrial reads. Often discarded or used a marker for bad cells, the high mutation rate and coverage of the mitochondria makes it a prime piece of real estate for detecting genetic differences among cells. Another important note of biological importance is that in many cases multiple mitochondria exist within individual cells, each in turn containing multiple copies of the mitochondrial genome. Thus the heterogeneity of mitochondria does not follow the standard Mendelian bifurcation into homo/heterozygous mutations. The term used to describe this phenomenon is heteroplasmy - which is defined as the presence of more than one organellar genome within a single cell.

Package goals

The primary goal of this package is to extract and analyze mitochondrial variants from single-cell RNA-seq data. The method should also work equally well with single-cell ATAC/DNA-seq data. Although the majority of functions provided by the package should work with BAM files from bulk sequencing experiments, a portion of the core functionality will be lost.

Genome availability

There is built in compatibility with the hg38/hg19 human genomes and mm10 mouse genome. IMPORTANT: The mitochondrial genome sequence used in the UCSC hg19 reference is different than that used for NCBI's GRCh37. GRCh37 uses the same reference as hg38/GRCh38 so make sure to check your relevant FASTA files to verify the expected length of your mitochondria. The lengths are as follows: UCSC hg38 (16569 bp), UCSC hg19 (16571 bp), and UCSC mm10 (16299 bp).

Input data

Before we can begin addressing the different aspects of filtering and identifying bonafide variants, it is important to import your data in the appropriate format. In our case, we rely most heavily on the pre-existing structure defined by the deepSNV::bam2R() function which produces 22 columns matrix with a number of rows corresponding to ranges provided as input parameters. IMPORTANT: For the majority of this vignette, the assumption is that we are working with the entire mitochondrial genome of a given organism. You are free to input any position by mutation matrix to the filtering and clustering functions, but keep in mind certain parameters should be adjusted.

Counting nucleotides from BAM files

To generate your count tables we have provided two functions to do so: Firstly, these tables can be created from a list of bam files using the function baseCountsFromBamList this function takes a list of bam files which can be created with list.files(). This function assumes that each individual single cell is represented by a single bam file. Alternatively, with our recent updates you can now pull variants directly from a 10x Genomics multiplexed BAM file using bam2R_10x. This relies on using the CB bam tag present within the BAM file to distinguish individual cells. Two helpful suggestions for this process are to initially subset BAM files for reads located on the mitochondrial chromosome and also to deduplicate reads if UMIs are present. You may also choose to count up the different nucleotides using a different method. In such cases, the only requirements for downstream mutation calling functions is that your baseCounts object be an R list that contains 5 to 8 columns where rows represent genomic positions and columns represent base calls (i.e. A,G,C,T,-,N,INS,DEL)

Counting function examples and explanations

The basic function provided is baseCountsFromBamList which reads in BAM files from a list and has four parameters:

Note: Be aware that this function calls deepSNV::bam2R() with default parameters.

## example using function for a list of BAM files   

baseCounts <- baseCountsFromBamList(bamfiles =
list(system.file("extdata", "mm10_10x.bam", package="mitoClone2")),
sites="chrM:1-15000",
ncores=1)
print(length(baseCounts))   

print(head(baseCounts[[1]][13267:13277,]))   

The next function is for reading in a multiplexed BAM file bam2R_10x takes multiple parameters:

## example using function for a multiplexed BAM files
## cell barcodes found by extracting the 'CB' BAM tag
baseCounts <- bam2R_10x(file = system.file("extdata", "mm10_10x.bam",
package="mitoClone2"), sites="chrM:1-15000", ncores=1)
print(length(baseCounts))   

print(head(baseCounts[[1]][13267:13277,]))   

Calling and filtering variants

The next step of the analysis is to take your counts of each nucleotide in each cell and identify variants that are present. A major barrier in this process is the inherent noise present in single cell datasets. Although we will only briefly discuss the important aspects of filtering this is by far the most crucial step of the analysis. The prevalence of mitochondrial variants is highly variable, in some cases more or less depending on the dataset. It is wise to adjust the parameters in order to tailor them to each particular dataset - one size fits all does not apply here. It is critical to choose optimal parameters for a dataset and experimentation is often necessary.

Calling mutations based on an exclusionlist

The simplest way to identify variants, beyond simply enriching for those that are present in only a fraction of the population, is by removing detected variants that are likely to be caused by noise (as previously discussed). One conservative approach to accomplish this is to simply remove all variants that overlap with problematic sites. As a side note, this type of approach will decrease false positive hits but may incidentally increase false negative hits. This type of method is used when a you only have a single sample from one individual/timepoint.

The function mutationCallsFromExclusionlist performs filtering and then removes any variants on the exclusionlists. Parameters:

Calling mutations based on variants shared within a cohort

The next way of filtering is a bit more intuitive. First of all, we make the following important assumption: We are comparing similar sets of cells between patients/timepoints and the only differences between them are biological (thus all technical steps are identical). Under this assumption, any technical noise as described previously '(e.g. sequencing artifacts, alignment errors) and even low levels of biological noise (e.g. RNA-editing) should be mostly preserved between samples. Thus, by identifying all shared features (i.e. variants) between two samples being compared, we can exclude all this noise and identify true somatic variants. TL;DR: We ask the question - which variants are only present in a single individual and not in others?

The function mutationCallsFromCohort performs filtering based on mutations shared across experiments/samples and then retrieves variants that are uniquely enriched in single patients. Parameters:

To exemplify the use of these functions we are going to build a figure based on data from a previously published data set. In this case, one of the first papers to recognize the use of mitochondrial variants in single-cell sequencing. Ludwig et al., 2019 Lineage Tracing in Humans Enabled by Mitochondrial Mutations and Single-Cell Genomics. Cell. 2019 Mar 7;176(6):1325-1339.e22. doi: 10.1016/j.cell.2019.01.022 https://pubmed.ncbi.nlm.nih.gov/30827679/. The dataset has been shrunk substantially for this vignette and the names of the mitochondrial variants are randomized.

In this example, as a proof-of-concept, the experimenters used donor hematopoietic stem and progenitor cells (HSPCs) to derive cell colonies and then isolated cells from each for single-cell sequencing. As expected, the cells that were isolated from the same colony tended exhibit shared mitochondrial variants (i.e. they are indeed (sub)-clonal). The variant sites have been randomized and shrunk in order to save space.

Example of Exclusionlist filtering

library(mitoClone2)
library(pheatmap)

## load our count data
load(system.file("extdata/example_counts.Rda",package = "mitoClone2"))

## investigate what our ideal baseCounts object looks like
print(head(example.counts[[1]]))

## calling mutations using our exclusionlist
Example <- mutationCallsFromExclusionlist(example.counts, min.af=0.05,
min.num.samples=5, universal.var.cells = 0.5 * length(example.counts),
binarize = 0.1)

## setting up the meta data
Example.rawnames <- getAlleleCount(Example,'nonmutant')
Example.meta <- data.frame(row.names = rownames(Example.rawnames),
Clone = gsub("_.*","",gsub("Donor1_","",rownames(Example.rawnames))))

## showing the clustering via heatmap (notice only 20 mutations were kept)
clustered <- quick_cluster(Example, binarize = TRUE, drop_empty = TRUE,
clustering.method = "ward.D2", annotation_col = Example.meta,
show_colnames=FALSE,fontsize_row = 7)

Example of Cohort filtering

Now (knowing the original parental colony from the publication) we apply cohort filtering to revealing similar results.

## Calling mutations using our known clonal information
Example <- mutationCallsFromCohort(example.counts, MINFRAC=0.01, MINCELL=8,
MINCELLS.PATIENT=5, patient = Example.meta$Clone, sites='chrM:1-55')
## check the variants for a specific cluster
print(colnames(getAlleleCount(Example$C112,'mutant')))
private.mutations <- unique(unlist(
lapply(Example[grep('^C',names(Example))][unlist(lapply(
Example[grep('^C',names(Example))],function(x)!is.null(x)))],
function(y) names(getVarsCandidate(y)))
))
private.mutations.df <- pullcountsVars(example.counts,
gsub("X(\\d+)([AGCT])\\.([AGCT])","\\1 \\2>\\3",private.mutations))
private.mutations.df <- mutationCallsFromMatrix(t(private.mutations.df$M),
t(private.mutations.df$N), cluster = rep(TRUE, length(private.mutations)))

## Showing the clustering looks via heatmap
clustered <- quick_cluster(private.mutations.df, binarize = TRUE,
drop_empty = TRUE, clustering.method = "ward.D2",
annotation_col = Example.meta,show_colnames=FALSE,fontsize_row = 7)

Session information

sessionInfo()


benstory/mitoClone2 documentation built on Sept. 6, 2024, 6:45 a.m.