README.md

mtAnalysis

Comprehensive annotation to mtDNA mutations.

Authors: Xianbang Sun (maintainer, sxb3000@bu.edu), Katia Bulekova, Xue Liu, Meng Lai, Chunyu Liu Date: “08/07/2022”

mtAnalysis is an R package for identifying and annotating mtDNA sequence variations. This package allows to identify mtDNA sequence variations with user-specified thresholds, and provide several plots to visualize mtDNA sequence variations. It also annotates all mtDNA variations and predicts functionality for mtDNA variants in the coding and tRNA regions.

Note: if you use mtAnalysis in published research, please cite:

Xianbang Sun, Katia Bulekova, Kaiyu Yan, Daniel Levy, Jun Ding, Chunyu Liu, Jessica L. Fetterman (2020) mtAnalysis: an R package for comprehensive annotation of mtDNA sequence variation and association analysis

Installation

You can install the development version of mtAnalysis package from GitHub with:

# install.packages("devtools")
# devtools::install_github("mtDNA-BU/mtAnalysis")
library(mtAnalysis)

In addition, our package requires the seqMeta package. The instructions for the installation of the seqMeta package can be found at https://github.com/DavisBrian/seqMeta

Standard workflow

Import the original allele, frequency and coverage datasets

input_path = "/path/to/input/directory/"
allele <- read.csv(paste0(input_path, "allele/allele_ANNOmtDNA.csv"), header = T, stringsAsFactors=FALSE,
                   colClasses = c("character"))
freq <- read.csv(paste0(input_path, "freq/freq_ANNOmtDNA.csv"), header=T, stringsAsFactors=FALSE,
                      colClasses = c("character"))
coverage <- read.csv(paste0(input_path, "coverage/coverage_ANNOmtDNA.csv"), header=T)
allele <- as.matrix(allele)
freq <- as.matrix(freq)
coverage <- data.matrix(coverage)

allele: a character matrix (16569 x N) provided by the user. Rows correspond to loci and columns correspond to subjects. This matrix contains the alleles of each subject at each locus. The matrix must contain subject ID as the column names. “/” is used to delimited different allele calls in a locus.

freq: a character matrix (16569 x N) provided by the user. Rows correspond to loci and columns correspond to subjects. This matrix contains the allele fractions of the corresponding allele matrix. The matrix must contain subject ID as the column names. “/” is used to delimited the allele fractions.

coverage: a numeric matrix (16569 x N). Rows correspond to loci and columns correspond to subjects. This matrix contains the reads coverage of the 16569 mtDNA loci for each subject. The matrix must contain the subject ID as the column names.

Compute alternative allele fraction (AAF) by mtAAF function

method is the argument to choose method to compute AAF for the case of multiple alternative alleles. The default method “maxAA” computes AAF as the maximum of frequencies of corresponding alternative alleles; and the alternative method “allAA” computes AAF as 1 minus the frequency of reference allele

AAF <- mtAAF(allele, freq, method = "maxAA")

Scatter plot of output of mtAAF function, each point is AAF of a subject at a locus, the x axis is the mtDNA loci and the y axis is the range of AAF (0-100%)

plot(AAF)

Plot the mean coverage of each locus by plotCover function, the x axis is the mtDNA loci and y axis is the mean coverage

plotCover(coverage, type="mean")

Histogram of the median coverage of each subject by histSampCov function

histSampCov(coverage, ylim=c(0,1200))

Summarize and annotate mtDNA mutations by mtSummary function

Specify the path to output the annotation file and histograms of mtDNA mutation burden of subjects and number of mutations carried by each loci for heteroplasmic and homoplasmic mutations

Users can specify lower bound and upper bound of the threshold by thre.lower and thre.upper arguments. By default, thre.lower=0.03 and thre.upper=0.97. That is, if 0.03\leq AAF\leq 0.97, it is a heteroplasmic mutation; and if AAF>0.97, it is a homoplasmic mutation. Users can also choose different gene regions by loci argument. For example, to choose tRNA region, set loci=“tRNA”. Users can also choose types of mutations to be annotated by type argument. For example, to choose heteroplasmic mutations, set type=“heter”, and choose homoplasmic mutations, set type=“homo”. Users can also specify the types of comprehensive predicted functional scores and categories by annot.select argument. The default is c(“Pos”, “ref”, “Gene”, “TypeMutation”,“MissensMutation”, “CodonPosition”, “ProteinDomain”, “dbSNP_150_id”, “PolyPhen2”, “PolyPhen2_score”, “SIFT”, “SIFT_score”, “CADD”, “CADD_score”, “CADD_phred_score”)

The types of comprehensive predicted functional scores and categories to choose are: TypeMutation, MissensMutation, CodonPosition, ProteinDomain, mFOLD_dG, mFOLD_Initial, mFOLD_rCRS DG, mFOLD_rCRS Initial, mFOLD_AnticodonAminoAcidChange, mFOLD_Location, PolyPhen2, PolyPhen2_score, SIFT, SIFT_score, PROVEAN, PROVEAN_score, MutationAssessor, MutationAssessor_score, CADD, CADD_score, CADD_phred_score, PANTHER, PANTHER_score, PhD_SNP, PhD_SNP_score, SNAP, SNAP_score, MutationTaster, MutationTaster_score, dbSNP_150_id

Run the mtSummary function, only include loci of coding region, annotate for both of heteroplasmic and homoplasmic mutations.

output_path <- "/output/dir/"
mtSum <- mtSummary(aaf=AAF, allele=allele, freq=freq, coverage=coverage,
         loci="coding", path=output_path, type="both", study="ARIC")

Part of the output of annotated alleles

#>   mtID ref_allele allele_all n_var n_heter n_homo mut_allele  Pos ref Gene
#> 3 3310          C        C/T     1       0      1          T 3310   C  ND1
#>    TypeMutation MissensMutation CodonPosition          ProteinDomain
#> 3 Nonsynonymous             P2S             1 Transmembrane; Helical
#>   dbSNP_150_id PolyPhen2 PolyPhen2_score    SIFT SIFT_score        CADD
#> 3         <NA>    benign            0.11 neutral       0.34 deleterious
#>   CADD_score CADD_phred_score
#> 3       2.45            19.15

Output of histograms of summarized mutations

Summary of the mean coverage of loci

mtSum$coverLoci 
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    4136    5315    5534    5520    5719    6231

Summary of the mean coverage of subjects

mtSum$coverSubjects  
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   935.7  4338.6  5325.6  5520.2  6469.6 16544.7

Display loci of variation

mtSum$loci_var  

Summary of the heteroplasmic burden of subjects

mtSum$heter_burden_sum  
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.0000  0.0000  0.0000  0.3901  1.0000 68.0000

Summary of numbers of heteroplasmic mutations loci carried

mtSum$heter_loci_sum  
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.0000  0.0000  0.0000  0.1385  0.0000  8.0000

Display loci of heteroplasmy

mtSum$loci_heter  

Total number of heteroplasmic mutations

mtSum$heter_total  
#> [1] 1571

Summary of the homoplasmic burden of subjects

mtSum$homo_burden_sum  
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    0.00    6.00   14.00   13.32   18.00   61.00

Summary of numbers of homoplasmic mutations loci carried

mtSum$homo_loci_sum  
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#>    0.000    0.000    0.000    4.729    0.000 3972.000

Display loci of homoplasmy

mtSum$loci_homo  

Total number of homoplasmic mutations

mtSum$homo_total  
#> [1] 53632

Annotate alternative alleles by mtAnno function

Generate alternative alleles to be annotated for mtDNA loci. It has two columns: loci positions (“pos”) and “alleles” to be annotated.

anno <- as.data.frame(matrix(0, 10, 2))
colnames(anno)<- c("pos", "alleles")
anno$pos <- c(3311:3320)
anno$alleles <- c("A", "A", "T", "C", "A", "T", "G", "A", "C", "T")

Run the mtAnno function

mtAnno(anno=anno, path=output_path)

Part of the output of annotated alleles

#>    pos alleles  Pos ref Gene  TypeMutation MissensMutation CodonPosition
#> 1 3311       A 3311   C  ND1 Nonsynonymous             P2H             2
#>            ProteinDomain dbSNP_150_id PolyPhen2 PolyPhen2_score    SIFT
#> 1 Transmembrane; Helical           NA    benign            0.01 neutral
#>   SIFT_score        CADD CADD_score CADD_phred_score
#> 1       0.34 deleterious       2.45            19.12


mtDNA-BU/ANNOmtDNA documentation built on Aug. 11, 2022, 1:46 p.m.