Given single-cell RNA-seq data from several individuals, this vignette describes how to identify somatic mitochondrial variants that are specific to single individuals, while at the same time creating a blacklist of sites that are observed in several individuals. The rational for this strategy is that artefactual (RNA-only) variants are often shared between individuals, whereas true somatic variants are not.

Description of input data

The first step for prepocessing will be to count variants (A, C, G, T, insertion, deletion) at every site of the mitochondrial genome. For this, a list of BAM files is required; to access the BAM files and raw data used for this study, we have to refer to EGA (Study EGAS00001003414). Assuming bam files are in a folder bam, count tables can be created with

countTables <- baseCountsFromBamList( list.files("bam", full.names = T) )

Here, we download pre-computed base count tables. We note that these consist of lists of matrices. Each entry in the list corresponds to a single cell. Matrix rows are sites in the mitochondrial genome, and columns are counts

countTables <- readRDS(url("http://steinmetzlab.embl.de/mutaseq/nuc.count.per.position.RDS"))
countTables <- countTables[grep('Q59',names(countTables),invert=T)]
print(head(countTables[[1]]))

Finally, we parse out the patient that each cell in the list derives from. We also count the number of cells per patient.

patient <- gsub("_.+","",names(countTables))
print(table(patient))

Cohort-level variant calling

We next use the mutationCallsFromCohort function to a) filter on coverage to exclude potentially noisy variants and b) compare allele frequencies between patients to remove variants that were observed in several individuals and that therefore are unlikely to represent true somatic variants (e.g. RNA editing events).

library(mitoClone)
result <- mutationCallsFromCohort(countTables, patient)
print(colnames(result$P342@M))
print(colnames(result$HRK@M))

The result for patient P1 (result$P342) an P2 (result$HRK) serves as input into the vignette Computation of clonal hierarchies and clustering of mutations. The blacklist (result$blacklist) serves as input into the vignette Variant calling and validation. We have added regions based on a conservative set of false-positive variants from our datsets and further trinucleotide repeats and softmased regions to this blacklist, since we observed that those tend to contain false positives as well. See the blacklist object provided as part of the package.

Notes on parameter choice

Some notes about possible parameters in the mutationCallsFromCohort: By default, we select coordinates in the mitochondrial genome containing at least MINREADS=5 reads each in at least MINCELL=20 cells. To distinguish RNA editing events and true mitochondrial mutations, we then identify mitochondrial variants that occur in several individuals. For this purpose only, individual cells are therefore called as 'mutant' in a given genomic site if at least MINFRAC=10% of the reads from that cell were from a minor allele (i.e. distinct from the reference). Mutations present in at least MINRELATIVE.PATIENT=1% of cells in a given patient, but no more than MINCELLS.PATIENT=10 cells in any other individual, are then included into the final dataset and counts supporting the reference and mutant alleles are computed as for sites of interest in the nuclear genome.

Depending on the choice of parameters here, sites with low frequencies of RNA editing or low levels of heteroplasmy in all cells can be included and make the dataset more noisy, in particular if the MINFRAC parameter is set too low. On the other hand, a too stringent filtering can cause a loss of potentially informative sites. Hence, we recommend to rather begin with parameters that yield a larger list of sites and then perform the following checks to, manually or through the use of altered filtering parameters, remove erroneous sites:

#retrieve the binary mutation status for nuclear and mitochondrial genomic mutation from read count tables (corresponds to result$P342 , but additionally contains nuclear mutation calls)
mutations <- mutationCallsFromMatrix(as.matrix(M_P1), as.matrix(N_P1))


getasso <- function(m1,m2) {
      testm <- matrix( data = c(
        sum(m1 =="1" &m2 == "1"),
        sum(m1 =="1" &m2 == "0"),
        sum(m1 == "0" & m2 == "1"),
        sum(m1 == "0" & m2 == "0")
      ), ncol = 2  )
      fisher.test(testm)
}


#run association tests between binary mitochondrial mutation status and binary SRSF2 mutation status
mitochondrial <- colnames(mutations@ternary)[grep("^X\\d+",colnames(mutations@ternary))]
tests <- apply(mutations@ternary[,mitochondrial], 2, getasso, m2 = mutations@ternary[,"SRSF2"])

p.val <- sapply(tests, "[[", "p.value")
est <- sapply(tests, "[[", "estimate")
plf <- data.frame(name = mitochondrial,
                  pval = -log10(p.val),
                  or = est)

library(ggplot2)
qplot(x = name , y = pval, color = paste0(or <1, pval > 1), data=na.omit(plf)) + coord_flip() + theme_bw()+ theme(panel.grid = element_blank(), axis.title.x = element_text(color="black"), axis.text = element_text(color="black"), axis.text.y = element_text(size = 6)) + xlab("") + ylab("Association w/ SRSF2 mutation (-log10 p)")+ scale_color_manual(values = c("TRUETRUE" = "blue","FALSETRUE" = "red", "FALSEFALSE" = "grey", "TRUEFALSE" = "grey"), guide=F)


veltenlab/mitoClone documentation built on April 18, 2021, 5:19 a.m.