Given single-cell RNA-seq data from several individuals, this vignette describes how to identify somatic mitochondrial variants that are specific to single individuals, while at the same time creating a blacklist of sites that are observed in several individuals. The rational for this strategy is that artefactual (RNA-only) variants are often shared between individuals, whereas true somatic variants are not.
The first step for prepocessing will be to count variants (A, C, G, T, insertion, deletion) at every site of the mitochondrial genome. For this, a list of BAM files is required; to access the BAM files and raw data used for this study, we have to refer to EGA (Study EGAS00001003414). Assuming bam files are in a folder bam
, count tables can be created with
countTables <- baseCountsFromBamList( list.files("bam", full.names = T) )
Here, we download pre-computed base count tables. We note that these consist of lists of matrices. Each entry in the list corresponds to a single cell. Matrix rows are sites in the mitochondrial genome, and columns are counts
countTables <- readRDS(url("http://steinmetzlab.embl.de/mutaseq/nuc.count.per.position.RDS")) countTables <- countTables[grep('Q59',names(countTables),invert=T)] print(head(countTables[[1]]))
Finally, we parse out the patient that each cell in the list derives from. We also count the number of cells per patient.
patient <- gsub("_.+","",names(countTables)) print(table(patient))
We next use the mutationCallsFromCohort
function to a) filter on coverage to exclude potentially noisy variants and b) compare allele frequencies between patients to remove variants that were observed in several individuals and that therefore are unlikely to represent true somatic variants (e.g. RNA editing events).
library(mitoClone) result <- mutationCallsFromCohort(countTables, patient) print(colnames(result$P342@M)) print(colnames(result$HRK@M))
The result for patient P1 (result$P342
) an P2 (result$HRK
) serves as input into the vignette Computation of clonal hierarchies and clustering of mutations. The blacklist (result$blacklist
) serves as input into the vignette Variant calling and validation. We have added regions based on a conservative set of false-positive variants from our datsets and further trinucleotide repeats and softmased regions to this blacklist, since we observed that those tend to contain false positives as well. See the blacklist
object provided as part of the package.
Some notes about possible parameters in the mutationCallsFromCohort
: By default, we select coordinates in the mitochondrial genome containing at least MINREADS=5
reads each in at least MINCELL=20
cells. To distinguish RNA editing events and true mitochondrial mutations, we then identify mitochondrial variants that occur in several individuals. For this purpose only, individual cells are therefore called as 'mutant' in a given genomic site if at least MINFRAC=10%
of the reads from that cell were from a minor allele (i.e. distinct from the reference). Mutations present in at least MINRELATIVE.PATIENT=1%
of cells in a given patient, but no more than MINCELLS.PATIENT=10
cells in any other individual, are then included into the final dataset and counts supporting the reference and mutant alleles are computed as for sites of interest in the nuclear genome.
Depending on the choice of parameters here, sites with low frequencies of RNA editing or low levels of heteroplasmy in all cells can be included and make the dataset more noisy, in particular if the MINFRAC
parameter is set too low. On the other hand, a too stringent filtering can cause a loss of potentially informative sites. Hence, we recommend to rather begin with parameters that yield a larger list of sites and then perform the following checks to, manually or through the use of altered filtering parameters, remove erroneous sites:
Use the quick_cluster
and the muta_cluster
functions to check if any potential mutations are randomly distributed and do not follow a hierarchy.
If data on genomic mutations is available, perform association tests and prioritize mitochondrial variants that are significantly associated with nuclear variants, for example for patient P1 (P342):
#retrieve the binary mutation status for nuclear and mitochondrial genomic mutation from read count tables (corresponds to result$P342 , but additionally contains nuclear mutation calls) mutations <- mutationCallsFromMatrix(as.matrix(M_P1), as.matrix(N_P1)) getasso <- function(m1,m2) { testm <- matrix( data = c( sum(m1 =="1" &m2 == "1"), sum(m1 =="1" &m2 == "0"), sum(m1 == "0" & m2 == "1"), sum(m1 == "0" & m2 == "0") ), ncol = 2 ) fisher.test(testm) } #run association tests between binary mitochondrial mutation status and binary SRSF2 mutation status mitochondrial <- colnames(mutations@ternary)[grep("^X\\d+",colnames(mutations@ternary))] tests <- apply(mutations@ternary[,mitochondrial], 2, getasso, m2 = mutations@ternary[,"SRSF2"]) p.val <- sapply(tests, "[[", "p.value") est <- sapply(tests, "[[", "estimate") plf <- data.frame(name = mitochondrial, pval = -log10(p.val), or = est) library(ggplot2) qplot(x = name , y = pval, color = paste0(or <1, pval > 1), data=na.omit(plf)) + coord_flip() + theme_bw()+ theme(panel.grid = element_blank(), axis.title.x = element_text(color="black"), axis.text = element_text(color="black"), axis.text.y = element_text(size = 6)) + xlab("") + ylab("Association w/ SRSF2 mutation (-log10 p)")+ scale_color_manual(values = c("TRUETRUE" = "blue","FALSETRUE" = "red", "FALSEFALSE" = "grey", "TRUEFALSE" = "grey"), guide=F)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.