AlphaBeta

options(width=120)
knitr::opts_chunk$set(
  collapse = FALSE,
  eval = TRUE,
  comment = " ",
  tidy.opts =list(width.cutoff=80),
  tidy = TRUE,
  size="small"
)

\newpage

library(AlphaBeta)
library(data.table)
library(igraph)
library(ggplot2)

\section{Introduction} AlphaBeta is a computational method for estimating epimutation rates and spectra from high-throughput DNA methylation data in plants. The method can be generally applied to study 'germline' epimuations in mutation accumulation lines (MA-lines), as well as 'somatic' epimutations in long-lived perennials, such as trees. Details regarding the inference approach and example applications can be found in Shahryary et al. 2020.

Experimental systems

A key challenge in studying epimutational proceses in multi-generational experiments is to be able to distinguish ’germline’ epimutations from other types of methylation changes, such as those that are associated with segregating genetic variation or transient environmental perturbations. Mutation accumulation lines (MA-lines) grown in controlled laboratory conditions are a powerful experimental system to achieve this. MA-lines are derived from a single isogenic founder and are independently propagated for a large number of generations. The lines can be advanced either clonally or sexually, i.e. self-fertilization or sibling mating (Fig. 1A). In clonally produced MA lines the isogenicity of the founder is not required because the genome is ’fixed’ due to the lack of genetic segregation.

The kinship among the different MA lineages can be presented as a pedigree (Fig. 1A). The structure (or topology) of these pedigrees is typically known, a priori, as the branch-point times and the branch lengths are deliberately chosen as part of the experimental design. In conjunction with multi-generational methylome measurements MA lines therefore permit ‘real-time’ observations of ’germline’ epimutations against a nearly invariant genomic background, and can facilitate estimates of the per generation epimutation rates.

Beyond experimentally-derived MA lines, natural mutation accumulation systems can also be found in the context of plant development and aging. An instructive example are long-lived perennials, such as trees, whose branching structure can be interpreted as a pedigree (or phylogeny) of somatic lineages that carry information about the epimutational history of each branch. In this case, the branch-point times and the branch lengths can be determined ad hoc using coring data, or other types of dating methods (Fig. 1B). By combining this information with contemporary leaf methylome measurements it is possible to infer the rate of somatic epimutations as a function of age (see Hofmeister et al.).

#knitr::include_graphics(paste0(output.dir=getwd(),"/Figure1.png"))
f1 <- system.file("extdata/vg", "Figure1.png",  package = "AlphaBeta")
knitr::include_graphics(f1)

DNA methylation sampling strategies

In the analysis of selfing or clonally-derived MA-lines, DNA methylation samples are typically obtained from the final generation (Fig. 2, endpoint sampling) and/or from intermediate generations (Fig. 2, intermediate sampling). With intermediate sampling, the samples can be obtained either directly from the progenitors of each generation (progenitor design, Fig. 2) or else from siblings of those progenitors (sibling design, Fig. 2). The sibling design is feasible in plants as seeds from early generations can be stored and grown out later. This is advantageous because plant material from all generations can be sampled simultaneously under identical conditions. An advantage of the progenitor design is that the methylation status of the pedigree founder is known, and that the methylation status of individual cytosines can be traced back in time through the entire pedigree. However, progenitor samples are taken in real-time (i.e. at selected generations) and thus growth conditions may vary over the experiment and introduce unwanted sources of noise. In the analysis of trees, DNA methylation samples come from contemporary leafs, as the methylomes of earlier developmental time-points are inaccessible (at least not easily accessible). Endpoint sampling is therefore the most obvious sampling strategy.

Irregardless of sampling design, AlphaBeta interprets the underlying pedigree as a sparse directed network. The nodes of the network correspond to ‘individuals’ whose methylomes have been sampled (i.e. type S* nodes), or of the common ancestors of these individuals, whose methylomes have typically not been sampled (i.e. type S nodes) (Fig.2).

f2 <- system.file("extdata/vg", "Figure2.pdf",  package = "AlphaBeta")
knitr::include_graphics(f2)

\section{Input files}

Pedigree files of MA lineages

To be able to re-construct the topology of the underlying pedigree, AlphaBeta requires two types of input files: "nodeslist.fn" and "edgelist.fn". The structure of these files follows the standard file format required by the R network package igraph.

#Load "nodelist.fn" file
sampleFile <- system.file("extdata/vg", "nodelist.fn",  package = "AlphaBeta")

"nodelist.fn" has the following structure

file <- fread(sampleFile)
head(file,10)

\setlength{\leftskip}{1cm}

filename: Lists the filenames corresponding to nodes (S and S ). Filenames of type S nodes should be identical with the names of their corresponding methylome files (see below). Type S nodes lacking corresponding methylome measurements, and should be designated by (-).

node: An arbitrary but unique name given to each node.

gen: Specifies the generation time of the nodes in the underlying pedigree.

meth: Indicates if methylome measurements are available for a given node (Y = yes, N = no).

\setlength{\leftskip}{0pt}

\pagebreak

The "edgelist.fn" file specifies the ancestral (i.e. lineages) relationship between nodes.

#Load "edgelist.fn" file
edgesFile <- system.file("extdata/vg", "edgelist.fn",  package = "AlphaBeta")

"edgelist.fn" has the following structure

edges <- fread(edgesFile)
head (edges, 10)

\setlength{\leftskip}{1cm}

from and to: Specifies the network edges, which are any direct connections between type S and S* nodes in the pedigree. Only unique pairs of nodes need to be supplied (Fig.1). These 2 columns are mandatory.

gendiff (optional): Specifies the number of generations that separate the two nodes. This column is useful only for plotting purpose and it can be omitted for epimutation rate estimation. However, we recommend that this column be supplied because it is useful for accurately scaling the edge lengths when plotting certain pedigrees with progenitor.endpoint and sibling design (see 4.1).

group (optional): Along with "gendiff" column, groupings supplied in this column will help in scaling the edge lengths when plotting the pedigree.

\setlength{\leftskip}{0pt}

Pedigree files of Trees

#Load "SOMA_nodelist.fn" file
treeSamples <- system.file("extdata/soma", "SOMA_nodelist2.fn", package="AlphaBeta")

"SOMA_nodelist.fn" has the following structure

fileTree <- fread(treeSamples)
head(fileTree, 10)

\setlength{\leftskip}{1cm}

filename: Lists the filenames corresponding to nodes S. Type S nodes lacking corresponding methylome measurements, and should be designated by (-).

node: An arbitrary but unique name given to each node.

Branchpoint_date: Specifies the branchpoint time of the nodes in the underlying pedigree.

meth: Indicates if methylome measurements are available for a given node (Y = yes, N = no).

\setlength{\leftskip}{0pt}

\pagebreak

The "SOMA_edgelist.fn" file specifies the ancestral (i.e. lineages) relationship between nodes.

treeEdges <- system.file("extdata/soma", "SOMA_edgelist2.fn", package="AlphaBeta")

"SOMA_edgelist.fn" has the following structure

fileEdges <- fread(treeEdges)
head(fileEdges, 10)

\setlength{\leftskip}{1cm}

from and to: Specifies the network edges, which are any direct connections between type nodes in the pedigree. Only unique pairs of nodes need to be supplied. These 2 columns are mandatory.

stem (optional): To be provided only for trees with 2 or more stems (as.in our example). This column should be left blank for a tree with a single stem.

\setlength{\leftskip}{0pt}

Methylome files

Type S* nodes in the pedigree have corresponding methylome data. In its current implementation, AlphaBeta expects methylome files that have been produced by \textit{methimpute} Methimpute package. \textit{methimpute} is a HMM-based methylation state caller for whole-genome bisulphite sequencing (WGBS) data. It can produce cytosine-level methylation state as well as region-level methylation methylation state calls. The former calls are required to obtain cytosine-level epimutation rates, while the latter calls are required to obtain region-level epimuation rates. Methylome files from alternative callers and/or measurement technologies are possible but should be converted to the \textit{methimpute} file structure (see below).

Cytosine-level calls

"cytosine-level methylome files" have the following structure

f3 <- system.file("extdata/vg", "methylome-sample.png",  package = "AlphaBeta")
knitr::include_graphics(f3)

\setlength{\leftskip}{1cm}

seqnames, start and strand: Chromosome coordinates

context: Sequence context of cytosine i.e CG,CHG,CHH

counts.methylated: Counts for methylated reads at each position

counts.total: Counts for total reads at each position

posteriorMax: Posterior value of the methylation state call

status : Methylation status

rc.meth.lvl: Recalibrated methylation level calculated from the posteriors and fitted parameters

context.trinucleotide: Trinucleotide context of the cytosine context

\setlength{\leftskip}{0pt}

Region-level calls

"region-level methylome files" have the following structure

f4 <- system.file("extdata/vg", "methylome-sample-region.png",  package = "AlphaBeta")
knitr::include_graphics(f4)

\setlength{\leftskip}{1cm}

seqnames, start and strand: Chromosome coordinates

posteriorMax: Posterior value of the methylation state call

status : Methylation status

rc.meth.lvl: Recalibrated methylation level calculated from the posteriors and fitted parameters

context: Sequence context of cytosine i.e CG,CHG,CHH

\setlength{\leftskip}{0pt}

Tips for converting files from alternative callers and/or technologies

Methylome files generated by alternative callers and/or measurement technologies should be converted to meet the \textit{methimpute} file structure. We have the following tentative recommendations.

NGS-based technologies: For NGS-based technologies including whole-genome bisulphite sequencing (WGBS), reduced presentation bisulphite sequencing (RRBS) and epigenotyping by sequencing (epiGBS), columns seqnames, start, strand, context, counts.methylated, counts.toal and status are typically available as they are part of standard outputs of BS-seq aligment software, such as Bismark and BSseeker. In this case, we recommend removing all rows where counts.total = 0, and set posteriorMax = 1, rc.meth.lvl = counts.methylated/counts.total, context.trinucleotide = NA.

Array-based technologies: Although we have not directly tested this, it should also be possible to convert methylome data from array-based technologies, including MeDIP-chip, to the methylome file structure required by AlphaBeta. In this case, we recommend to set seqnames = probe chromosome position, start = probe start position, strand = arbitrarily fix to (+), context = arbitrarily fix to context "CG", counts.methylated = NA, count.total = NA, posteriorMax = 1, rc.meth.lvl = array methylation signal, context.trinucleotide = NA.

\section{Building pedigree} The function "buildPedigree" builds the pedigree from the input files. Divergence time (delta.t) is calculated as follows: delta.t = t1 + t2 - 2*t0, where t1 is the time of sample 1 (in generations), t2 is the time of sample 2 (in generations) and t0 is the time (in generations) of the most recent common founder of samples 1 and 2.

Building MA-lines Pedigree

output <- buildPedigree(nodelist = sampleFile, 
                        edgelist = edgesFile,
                        cytosine = "CG",
                        posteriorMaxFilter = 0.99)
rFile <- system.file("extdata/dm","output.rds", package="AlphaBeta")
output <- readRDS(rFile )

Divergence values (D.value):

head(output$Pdata)     

\setlength{\leftskip}{1cm}

time 1: Generation time of sample \textit{i}

time 2: Generation time of sample \textit{j}

time 0: Generation time of most recent common ancestor of samples \textit{i} and \textit{j}

D.value: Mean absolute divergence in DNA methylation states between samples \textit{i} and \textit{j}

\setlength{\leftskip}{0pt}

Building Tree Pedigree

outputTree <- buildPedigree(nodelist = treeSamples, 
                        edgelist = treeEdges,
                        cytosine = "CG",
                        posteriorMaxFilter = 0.99)
TrFile <- system.file("extdata/soma","outputSoma.rds", package="AlphaBeta")
outputTree <- readRDS(TrFile)

Divergence values (D.value):

head(outputTree$Pdata)     

\section{Diagnostic plots}

The correct specification of the pedigree topology and the removal of influential outlier data points are critical aspects for epimutation rate estimation. Visual inspection of the pedigree is provided through the function "plotPedigree".

Plotting pedigrees

Pedigree of MA-lines

## Progenitor-endpoint design
plotPedigree(nodelist=sampleFile, 
               edgelist=edgesFile,
               sampling.design="progenitor.endpoint",
               output.dir=out.dir,
               plot.width=5, 
               plot.height=5,
               aspect.ratio=1,
               vertex.size=6,
               vertex.label=FALSE,
               out.pdf="MA1_1")

## Sibling design
plotPedigree(nodelist=system.file("extdata/vg","nodelist_MA2_3.fn", package="AlphaBeta"), 
               edgelist=system.file("extdata/vg","edgelist_MA2_3.fn", package="AlphaBeta"),
               sampling.design="sibling",
               output.dir=out.dir,
               plot.width=5, 
               plot.height=5,
               aspect.ratio=2.5,
               vertex.size=12,
               vertex.label=FALSE,
               out.pdf="MA2_3")

## Progenitor-intermediate design
plotPedigree(nodelist=system.file("extdata/vg","nodelist_MA3.fn", package="AlphaBeta"),
               edgelist=system.file("extdata/vg","edgelist_MA3.fn", package="AlphaBeta"),
               sampling.design="progenitor.intermediate",
               output.dir=out.dir,
               plot.width=5,
               plot.height=8,
               aspect.ratio=2.5,
               vertex.size=13,
               vertex.label=FALSE,
               out.pdf="MA3")
f5 <- system.file("extdata/vg", "MAlines.png",  package = "AlphaBeta")
knitr::include_graphics(f5)

\pagebreak

Tree pedigrees

plotPedigree(nodelist=treeSamples, 
               edgelist=treeEdges,
               sampling.design="tree",
               output.dir=out.dir,
               plot.width=5, 
               plot.height=5,
               aspect.ratio=1,
               vertex.size=8,
               vertex.label=FALSE,
               out.pdf="Tree")

\setlength{\leftskip}{1cm}

nodelist: file containing list of nodes

edgelist: files containing list of edges

sampling.design: set sampling design according to pedigree

output.dir: set output directory path

plot.width, plot.height: set width and height of the output pdf

aspect.ratio, vertex.size: for adjusting the ratio of height to width of the pedigree plot

vertex.label: vertex labels can be printed with either TRUE or FALSE

out.pdf: set NULL to print plot on screen or set name to output as pdf

\setlength{\leftskip}{0pt}

f6 <- system.file("extdata/vg", "Trees.png",  package = "AlphaBeta")
knitr::include_graphics(f6)

Plotting divergence time (delta.t) versus methylome divergence (D.value)

This is an interactive plot for inspecting the divergence data and removing outlier samples (if any):

pedigree<- output$Pdata
dt <- pedigree[,2] + pedigree[,3] - 2 * pedigree[,1]
plot(dt, pedigree[,"D.value"], ylab="Divergence value", xlab=expression(paste(Delta, " t")))

\section{Epimutation rate estimation in selfing-systems}

Models ABneutral, ABselectMM, ABselectUU and ABnull can be used to estimate the rate of spontaneous epimutations in selfing-derived MA-lines. The models are currently restricted to diploids.

Run Models

Run Model with no selection (ABneutral)

ABneutral fits a neutral epimutation model. The model assumes that epimutation accumulation is under no selective constraint. Returned are estimates of the methylation gain and loss rates and the proportion of epi-heterozygote loci in the pedigree founder genome.

Initial proportions of unmethylated cytosines:

p0uu_in <- output$tmpp0
p0uu_in
pedigree <- output$Pdata
# output directory
output.data.dir <- paste0(getwd()) 

output <- ABneutral(
  pedigree.data = pedigree,
  p0uu = p0uu_in,
  eqp = p0uu_in,
  eqp.weight = 1,
  Nstarts = 2,
  out.dir = output.data.dir,
  out.name = "ABneutral_CG_global_estimates"
)

NOTE: it is recommended to use at least 50 Nstarts to achieve best solutions

Showing summary output of only output:

summary(output)
head(output$pedigree)

Plot estimates of ABneutral model:

ABfile <- system.file("extdata/models/", 
                      "ABneutral_CG_global_estimates.Rdata", 
                      package = "AlphaBeta")
#In 'ABplot' function you can set parameters to customize the pdf output. 
ABplot(pedigree.names=ABfile, output.dir=getwd(), out.name="ABneutral", plot.height=8, plot.width=11)

\pagebreak

f7 <- system.file("extdata/vg", "ABneutral_MA1_1.pdf",  package = "AlphaBeta")
knitr::include_graphics(f7)

Run model with selection against spontaneous gain of methylation (ABselectMM)

ABselectMM fits an epimutation model with selection. The model assumes that epimutation accumulation is in part shaped by selection against spontaenous losses of cytosine methylation. Returned are estimates of the methylation gain and loss rates, the selection coeficient, and the proportion of epi-heterozygote loci in the pedigree founder genome.

outputABselectMM <- ABselectMM(
  pedigree.data = pedigree,
  p0uu = p0uu_in,
  eqp = p0uu_in,
  eqp.weight = 1,
  Nstarts = 2,
  out.dir = output.data.dir,
  out.name = "ABselectMM_CG_global_estimates"
)

Run model with selection against spontaneous loss of methylation (ABselectUU)

ABselectUU fits an epimutation model with selection. The model assumes that epimutation accumulation is in part shaped by selection against spontaenous gains of cytosine methylation. Returned are estimates of the methylation gain and loss rates, the selection coeficient, and the proportion of epi-heterozygote loci in the pedigree founder genome.

outputABselectUU <- ABselectUU(
  pedigree.data = pedigree,
  p0uu = p0uu_in,
  eqp = p0uu_in,
  eqp.weight = 1,
  Nstarts = 2,
  out.dir = output.data.dir,
  out.name = "ABselectUU_CG_global_estimates"
)

Run model that considers no accumulation of epimutations (ABnull)

ABnull fits a model of no epimutation accumulation. This model serves as the Null model.

outputABnull <- ABnull(
  pedigree.data = pedigree,
  out.dir = output.data.dir,
  out.name = "ABnull_CG_global_estimates"
  )

Comparison of different models and selection of best model

Testing ABneutral vs. ABnull

file1 <-
  system.file("extdata/models/",
              "ABneutral_CG_global_estimates.Rdata",
              package = "AlphaBeta")
file2 <-
  system.file("extdata/models/",
              "ABnull_CG_global_estimates.Rdata",
              package = "AlphaBeta")

out <- FtestRSS(pedigree.select = file1,
                pedigree.null = file2)

out$Ftest

Testing ABselectMM vs.ABneutral

file1 <-
  system.file("extdata/models/",
              "ABselectMM_CG_global_estimates.Rdata",
              package = "AlphaBeta")
file2 <-
  system.file("extdata/models/",
              "ABnull_CG_global_estimates.Rdata",
              package = "AlphaBeta")

out <- FtestRSS(pedigree.select = file1,
                pedigree.null = file2)

out$Ftest

Testing ABselectUU vs.ABneutral

file1 <-
  system.file("extdata/models/",
              "ABselectUU_CG_global_estimates.Rdata",
              package = "AlphaBeta")
file2 <-
  system.file("extdata/models/",
              "ABnull_CG_global_estimates.Rdata",
              package = "AlphaBeta")

out <- FtestRSS(pedigree.select = file1,
                pedigree.null = file2)

out$Ftest

Bootstrap analysis with the best fitting model(BOOTmodel)

i.e ABneutral in our case

NOTE: it is recommended to use at least 50 Nboot to achieve best solutions

inputModel <- system.file("extdata/models/",
              "ABneutral_CG_global_estimates.Rdata",
              package = "AlphaBeta")

# Bootstrapping models CG

Boutput <- BOOTmodel(
  pedigree.data = inputModel,
  Nboot = 2,
  out.dir = getwd(),
  out.name = "ABneutral_Boot_CG_global_estimates"
)

summary(Boutput)
Boutput$standard.errors

\section{Epimutation rate estimation in clonal, asexual and somatic systems}

Models ABneutralSOMA, ABselectMMSOMA and ABselectUUSOMA can be used to estimate the rate of spontaneous epimutations from pedigree-based high-throughput DNA methylation data. The models are generally designed for pedigree data arising from clonally or asexually propagated diploid species. The models can also be applied to long-lived perrenials, such as trees, using leaf methylomes and coring data as input. In this case, the tree branching structure is treated as an intra-organismal pedigree (or phylogeny) of somatic lineages.

Run Models

Run Model with no selection (ABneutralSOMA)

This model assumes that somatically heritable gains and losses in cytosine methylation are selectively neutral.

Initial proportions of unmethylated cytosines:

Tree_p0uu_in <- outputTree$tmpp0
Tree_p0uu_in
pedigree.Tree <- outputTree$Pdata
outputABneutralSOMA <- ABneutralSOMA(
  pedigree.data = pedigree.Tree,
  p0uu = Tree_p0uu_in,
  eqp = Tree_p0uu_in,
  eqp.weight = 0.001,
  Nstarts = 2,
  out.dir = getwd(),
  out.name = "ABneutralSOMA_CG_global_estimates"
)
ABfilesoma <- system.file("extdata/models/", 
                      "ABneutralSOMA_CG_global_estimates.Rdata", 
                      package = "AlphaBeta")
outputABneutralSOMA <- dget(ABfilesoma)
summary(outputABneutralSOMA)
head(outputABneutralSOMA$pedigree)

Plot estimates of ABneutralSOMA model:

ABfilesoma <- system.file("extdata/models/", 
                      "ABneutralSOMA_CG_global_estimates.Rdata", 
                      package = "AlphaBeta")
ABplot(pedigree.names=ABfilesoma, output.dir=getwd(), out.name="ABneutralSOMA", plot.height=8, plot.width=11)
f8 <- system.file("extdata/vg", "ABneutralSOMA.pdf",  package = "AlphaBeta")
knitr::include_graphics(f8)

Run model with selection against spontaneous gain of methylation (ABselectMMSOMA)

This model assumes that somatically heritable losses of cytosine methylation are under negative selection. The selection parameter is estimated.

outputABselectMMSOMA <- ABselectMMSOMA(
  pedigree.data = pedigree.Tree,
  p0uu = Tree_p0uu_in,
  eqp = Tree_p0uu_in,
  eqp.weight = 0.001,
  Nstarts = 2,
  out.dir = getwd(),
  out.name = "ABselectMMSOMA_CG_global_estimates"
)

Run model with selection against spontaneous loss of methylation (ABselectUUSOMA)

This model assumes that somatically heritable gains of cytosine methylation are under negative selection. The selection parameter is estimated.

outputABselectUUSOMA <- ABselectUUSOMA(
  pedigree.data = pedigree.Tree,
  p0uu = Tree_p0uu_in,
  eqp = Tree_p0uu_in,
  eqp.weight = 0.001,
  Nstarts = 2,
  out.dir = getwd(),
  out.name = "ABselectUUSOMA_CG_global_estimates"
)

Bootstrap analysis with the best fitting model (BOOTmodel)

inputModelSOMA <- system.file("extdata/models",
              "ABneutralSOMA_CG_global_estimates.Rdata",
              package = "AlphaBeta")

# Bootstrapping models CG

Boutput <- BOOTmodel(
  pedigree.data = inputModelSOMA,
  Nboot = 2,
  out.dir = getwd(),
  out.name = "ABneutral_Boot_CG_global_estimates"
)

summary(Boutput)
Boutput$standard.errors

\section{R session info }

sessionInfo()


Try the AlphaBeta package in your browser

Any scripts or data that you put into this service are public.

AlphaBeta documentation built on Nov. 8, 2020, 6:30 p.m.