knitr::opts_chunk$set(message = FALSE,warning = FALSE)
library(ggdag) library(pander) library(tidyverse) library(data.tree)
This package implements the Bayesian Modelling for Exon Arrays (BMEA) approach as a simple package, with MCMC sampling performed in C and genes analysed in parallel. Currently, all chains are initialised within the same computational node.
The BMEA model can be applied to any whole transcript array for which a CDF exists, in which an exon-level structure is defined. Under this model, an observed probe intensity for a given gene $PM_{hijk}$ is assumed to be an additive sum of background signal ($B_{hijk}$) and 'true' signal ($S_{hijk}$) such that
$$ PM_{hijk} = B_{hijk} + S_{hijk} $$
where the treatment group ($h$), sample ($i$), exon ($j$) and probe ($k$) are all denoted by the appropriate subscript.
The background component is assumed to be log-normally distributed $\log B_{hijk} \sim \mathcal{N}(\lambda_{hijk}, \delta_{hijk})$, where the values $\lambda_{hijk}$ and $\delta_{hijk}$ are dependent on probe sequence and are estimated in advance of gene-level analysis.
The signal component is also assumed to be log-normally distributed $\log S_{hijk} \sim \mathcal{N}(\eta_{hijk}, \sigma_S)$ where $\sigma_S$ is the general variance term for all probes within a given gene. The mean ($\eta_{hijk}$) is the sum of the overall expression level $c_{hi}$, probe-level terms ($p_k$) and an exon-level term $\log \phi_{hj}$, where $\phi_{hj}$ is the proportion of transcripts containing exon $j$ in treatment group $h$.
$$ \eta_{hijk} = c_{hi} + p_k + \log \phi_{hj} $$
This equation is based on the model
$$ S_{hijk} = \phi_{hj} e^{c_{hi} + p_k} $$
where $\phi_{hj} = 1$ for constitutive exons, and $\phi_{hj} = 0$ for completely absent exons.
dagify(PM ~ B + S, B ~ l, B ~ d, S ~ e, S ~ s, e ~ c, e ~ p, e ~ phi, p ~ sp, c ~ mu, c ~ sm, labels = c(PM = "PM[hijk]", B = "B[hijk]", S = "S[hijk]", l = "lambda[il]", d = "delta[il]", e = "eta[hijk]", s = "sigma[S]", c = "c[hi]", p = "p[k]", phi = "phi[hj]", sp = "sigma[p]", mu = "mu[h]", sm = "sigma[mu]") ) %>% tidy_dagitty() %>% mutate(x = case_when( name == "PM" ~ 0, name == "B" ~ -2, name == "S" ~ 2, name == "e" ~ -0.5, name == "c" ~ -2.5, name == "p" ~ -1, name == "phi" ~ 0.5, name == "mu" ~ -3.5, name == "sm" ~ -2.5, name == "sp" ~ -1, name == "s" ~ 2, name %in% c("d", "l") ~ -3.5 ), xend = case_when( to == "PM" ~ 0, to == "B" ~ -2, to == "S" ~ 2, to == "e" ~ -0.5, to == "c" ~ -2.5, to == "p" ~ -1, to == "phi" ~ 0.5, to == "mu" ~ -3.5, to == "sm" ~ -2.5, to == "sp" ~ -1, to == "s" ~ 2, to %in% c("d", "l") ~ -3.5 ), y = case_when( name == "PM" ~ 1, name %in% c("B", "S") ~ 2, name == "e" ~ 3, name %in% c("c", "p", "phi") ~ 4, name %in% c("mu", "sm", "sp", "s") ~ 5.5, name == "d" ~ 1.8, name == "l" ~ 2.7 ), yend = case_when( to == "PM" ~ 1, to %in% c("B", "S") ~ 2, to == "e" ~ 3, to %in% c("c", "p", "phi") ~ 4, to %in% c("mu", "sm", "sp", "s") ~ 5.5, to == "d" ~ 1.8, to == "l" ~ 2.7 ), shape = name %in% c("d", "l"), label = paste0("italic(", label, ")")) %>% ggplot(aes(x, y, xend = xend, yend= yend)) + geom_dag_edges() + geom_dag_node(aes(shape = shape), colour = "white", internal_colour = "black") + geom_dag_text(aes(label = label), parse = TRUE, colour = "black", size = 5, family = "Times") + geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), data = data_frame(ymin = 0.5, ymax = 4.8, xmin = -3, xmax = 2.5), fill = rgb(1,1,1, 0), colour = "black", inherit.aes = FALSE) + geom_text(aes(x, y, label = label), data = data_frame( label = c("italic(PM[hijk]) == italic(B[hijk] + S[hijk])", "log(italic(S[hijk])) %~%~N(italic(eta[hijk],sigma[S]))", "log(italic(B[hijk])) %~%~N(italic(lambda[il],delta[il]))", "italic(eta[hijk] == c[hi] + p[k] + log(phi[hj]))", "italic(phi[hj] %~%~U(0,1))", "italic(p[k] %~%~N(0, sigma[p]))", "italic(c[hi] %~%~N(mu[h], sigma[mu]))", "italic(sigma[S] %prop% sigma[S]^-1)", "italic(sigma[p] %~%~U(0, 10))", "italic(sigma[mu] %~%~U(0, 5))", "italic(mu[h] %~%~U(0, 2^16))") ) %>% mutate(x = 2.8, y = seq(0.5, by = 0.52, length.out = nrow(.))), inherit.aes = FALSE, family = "Times", parse = TRUE, size = 5, hjust = 0, colour = "black") + scale_shape_manual(values = c(21, 22)) + scale_x_continuous(limits = c(-4, 5)) + scale_y_continuous(limits = c(0.5, 7)) + guides(shape = FALSE) + theme_void()
The modified MAT model [@Kapur2007] is first fitted on the set of background probes to obtain estimates of all MAT model parameters, and fitted values for each BG probe. BG probes are then divided in $l = 1, 2, ..., L$ approximately equal bins (usually $L = 20$) based on fitted values. Means and standard deviations from each bin for each array are then used as bin-specific values $\lambda_{il}$ and $\delta_{il}$ respectively.
The model is then applied to the probe sequences of all $PM$ probes to obtain a fitted value for each probe, and these are assigned to the appropriate bin, with values $\lambda_{il}$ and $\delta_{il}$ specified as the hyperparameters for the BG signal prior distributions.
BMEA
relies on the same package infrastructure as aroma.affymetrix
and requires files to be in the same locations as specified by this package.
The required layout is as follows
path <- c( "parentDirectory/annotationData/chipTypes/chipType/chipType.cdf", "parentDirectory/annotationData/chipTypes/chipType/chipType_bgProbes.bgp", "parentDirectory/annotationData/chipTypes/chipType/chipType.probe_tab", "parentDirectory/rawData/exptName/chipType/File1.CEL", "parentDirectory/rawData/exptName/chipType/File2.CEL" ) data.tree::as.Node(data.frame(pathString = path))
For analysis of data using HuEx-1.0-st-v2
arrays, using both the Affymetrix CDF and a custom CDF, for an experiment called myExpt
this may look like.
Here we'll be using the antigenomic background probes for estimation of parameters.
path <- c( "parentDirectory/annotationData/chipTypes/HuEx-1_0-st-v2/HuEx-1_0-st-v2.cdf", "parentDirectory/annotationData/chipTypes/HuEx-1_0-st-v2/HuEx-1_0-st-v2.r2.antigenomic.bgp", "parentDirectory/annotationData/chipTypes/HuEx-1_0-st-v2/HuEx-1_0-st-v2,Custom.cdf", "parentDirectory/annotationData/chipTypes/HuEx-1_0-st-v2/HuEx-1_0-st-v2,Custom_probe_tab", paste0("parentDirectory/rawData/myExpt/HuEx-1_0-st-v2/File", 1:8, ".CEL") ) data.tree::as.Node(data.frame(pathString = path))
Once we have the data in the correct directory structure, we can load the package and define the required data objects.
library(BMEA) library(limma) library(snow) library(magrittr) library(tidyverse)
Now we can define the two CDF files required for the analysis
knitr::opts_chunk$set(eval = FALSE)
chipType <- "HuEx-1_0-st-v2" affyCdf <- AffymetrixCdfFile$byChipType(chipType) myCdf <- AffymetrixCdfFile$byChipType(chipType, tags="Custom")
For our experiment myExpt
we now define the experiment name, and organise the CEL files into an AffymetrixCelSet
using the aroma.affmetrix
infrastructure.
exptName <- "myExpt" cs <- AffymetrixCelSet$byName(exptName, cdf=affyCdf)
Background correction is not required for BMEA as it is included in the model-fitting stage. Instead, the first required step will be to quantise normalise the data. First we define the process, then we perform the normalisation. This may take a little while. Note also, that this will quantile normalise the entire set of probes on the array, treating background and $PM$ probes equally.
qn <- QuantileNormalization(cs) csN <- process(qn, verbose=verbose)
Now we have normalised the data, we can fit the MAT background model on our given set of background probes. As these probes may be missing from the custom CDF, we'll stay with the Affymetrix CDF for this step.
bgParam <- fitBackgroundParameters(csN, cdf=affyCdf, bgProbes="r2.antigenomic.bgp", method="MAT")
MAT model parameters will be fitted using all arrays, and fitted values will be in the list element bgParam$fitted
.
Model coefficients will be in bgParam$coef
whilst observed log intensities for each probe will be in bgParam$observed
After fitting the model, we can estimate values for the prior for each bin, on each array.
bgBins <- defineMatBins(bgParam)
We can check these values across the entire experiment
c("lambda", "delta") %>% lapply(function(x){ set_rownames(bgBins[[x]], paste0("Bin", 1:nrow(bgBins[[x]]))) %>% as.data.frame(stringsAsFactors = FALSE) %>% rownames_to_column("Bin") %>% gather("Array", "Estimate", -Bin) %>% mutate(Parameter = paste0("hat(", x, ")[il]"), Bin = factor(Bin, levels = paste0("Bin", 1:nrow(.)))) }) %>% bind_rows() %>% as_tibble() %>% ggplot(aes(Bin, Estimate)) + geom_boxplot() + facet_wrap(~Parameter, scales = "free", ncol = 1, labeller = label_parsed) + theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
Now we can define a set of CEL files which contain the values $\lambda_{hijk}$ and $\delta_{hijk}$ based on which bin $l = 1, 2, ..., L$ that each probe belongs to. Before doing this, first we should set the CDF for the dataset as our custom CDF.
setCdf(csN, myCdf)
Now we can use the probe sequences contained in our custom CDF to assign them to the appropriate bin for expected background signal.
pmSequenceFile <- "HuEx-1_0-st-v2,Custom.probe_tab" bgCelSet <- assignBgPriors(csN, seqFile=pmSequenceFile, bgBins=bgBins, bgParam=bgParam, overWrite=TRUE)
First we need to define the treatment groups and the contrast matrix. Here we'll just assign half of our samples to each group.
n <- length(csN) conditions <- rep(c("A", "B"), each = floor((n+1)/2))[seq_len(n)] conditions <- factor(conditions, levels = c("A", "B")) contrastMatrix <- makeContrasts(BVsA = B - A, levels = conditions)
Now we set our MCMC parameters.
mcmcParam <- list(nChains = 3L, nIter = 12000L, nBurnin = 6000L, nThin = 6L)
Each gene will now be fit using r mcmcParam$nChains
independent chains, for r mcmcParam$nIter
iterations, discarding the first r mcmcParam$nBurnin
as the burnin period.
Only one value in every r mcmcParam$nThin
values will be saved during the MCMC process.
We can also set the process to run on any specific units, which must be specified numerically.
units <- data_frame(unitID = getUnitNames(myCdf), unit = seq_along(unitID))
The process is implemented so it will run on parallel nodes defined by the package snow
, so let's define 6 nodes, which can be run on a standard desktop with 4 multi-threaded cores.
Please note that this process can run for several days.
Testing 18 arrays on a workstation with 20 threads will complete in 24-36 hours.
In the following, we'll fit 20 genes on each node for a total of 60 genes.
nCores <- 3 fitUnits <- units$unit[seq_len(nCores*20)] cl <- makeCluster(nCores, type="SOCK") bmeaFit <- fitBmea.Snow(celSet = csN, bgCelSet = bgCelSet, cl = cl, units = fitUnits, batchSize = 10, conditions = conditions, contMatrix = contrastMatrix, paramToWrite = c("c", "mu", "phi"), mcmcParam = mcmcParam) stopCluster(cl)
The above process will record the output from each node in a separate folder, which will then need to be merged into a single set of CEL files. The saved parameters will have the 2.5, 25, 50, 75, and 97.5th percentiles of the posterior distributions saved in separate CEL files, along with the posterior mean and sd.
mergeNodes(celSet=csN, as.list(names(bmeaFit$units)), paramToWrite=c("c", "mu","phi"))
If you're happy with the merged results, you can easily delete the files from each individual node.
clearNodes(names(bmeaFit$units))
In order to inspect the results, we can define an AffymetrixCelSetList
for our parameter of interest.
Let's start with logFC
which under BMEA will be defined as $\Delta \mu = \mu_B - \mu_A$ in keeping with the defined contrast matrix and the DAG in Figure \@ref(fig:DAG).
csLogFC <- AffymetrixCelSetList(csN, type="contrast", tags="logFC")
Once we have defined an AffymetrixCelSetList
, we can use extractBmeaArray()
to extract the values we need as an array.
In the following, we'll extract the values required for a 95\% Central Posterior Interval, and a simple $B$ statistic which can be used for ranking results.
Note that in the following, we're also converting logFC to the $\log_2$ scale.
bmeaLogFC <- extractBmeaArray(csLogFC)[, c("2.5%", "mean", "97.5%", "B"),"BVsA"] %>% as.data.frame() %>% rownames_to_column("unitID") %>% as_data_frame() %>% mutate(mean = mean / log(2), `2.5%` = `2.5%` / log(2), `97.5%` = `97.5%` / log(2)) %>% filter(is.finite(mean)) %>% # Get rid of genes not fitted arrange(desc(abs(B)))
From here you can simply inspect the list and decide which genes are DE given any filtering criteria you choose to apply.
In order to detect any potential alternate splicing events, we can also check the posterior distribution for $\Delta \log \phi$
csPhiLogFC <- AffymetrixCelSetList(csN, type="contrast", tags="phiLogFC")
Instead of checking all genes, the most highly expressed quartile was shown to be the most reliable for generating candidate AS events, as for these genes, the true signal far exceeds the background signal on the array itself. This leads to more high-confidence candidates. Here we'll just load in the posterior mean as a point estimate.
csMu <- AffymetrixCelSetList(csN, type="model", tags="mu") mu <- extractBmeaArray(csMu, units = fitUnits)[, "mean", c("A", "B")] %>% as.data.frame() %>% rownames_to_column("unitID") %>% as_tibble() %>% gather("Array", "Mu", -unitID)
Now we can find the top quartile.
muTopQ <- mu %>% filter(Mu > 0) %>% group_by(unitID) %>% summarise(Mu_mean = mean(Mu)) %>% filter(Mu_mean > quantile(Mu_mean, probs = 0.75)) %>% left_join(units)
In addition, genes with minimal logFC
are easier to assess via laboratory methods, and you like to filter your data to exclude genes with any detected fold-change.
logFCLowQ <- filter(bmeaLogFC, abs(mean) < quantile(abs(mean), probs = 0.25))
Now we can find the intersection of highly-expressed genes, with those for which there is minimal evidence of fold-change.
candidateUnits <- intersect(logFCLowQ$unitID, muTopQ$unitID)
We can extract these units and inspect them for candidate exons.
phiLogFC <- extractBmeaArray(csPhiLogFC, units = candidateUnits, firstOnly = FALSE) %>% magrittr::extract(, c("2.5%", "50%", "97.5%", "B"),"BVsA") %>% as.data.frame() %>% rownames_to_column("groupID") %>% as_data_frame() %>% rename(median = `50%`) %>% mutate(median = median / log(2), `2.5%` = `2.5%` / log(2), `97.5%` = `97.5%` / log(2))
phiLogFC %>% separate(groupID, into = c("unitID", "groupID"), sep = "\\.") %>% arrange(desc(abs(B)))
In order to restrict results to the most high confidence set of candidates, choosing those with a highest $Z$-score as used during the DABG stages is also advisable. A viable approach would be to choose those with a $Z$-score in the highest quartile. In the following, we'll obtain $Z$-scores for the fitted units only
fitUgc <- getUnitGroupCellMap(myCdf, units = fitUnits, retNames = TRUE) fitPM <- getIntensities(csN, indices = fitUgc$cell) %>% set_colnames(celNames) %>% as.data.frame() %>% split(f = fitUgc$unit) %>% parallel::mclapply(as.matrix, mc.cores = 3) fitLambda <- getIntensities(bgCelSet$lambda, indices = fitUgc$cell) %>% set_colnames(celNames) %>% log() %>% as.data.frame() %>% split(f = fitUgc$unit) %>% parallel::mclapply(as.matrix, mc.cores = 3) fitDelta <- getIntensities(bgCelSet$delta, indices = fitUgc$cell) %>% set_colnames(celNames) %>% log() %>% as.data.frame() %>% split(f = fitUgc$unit) %>% parallel::mclapply(as.matrix, mc.cores = 3) fitZ <- names(fitPM) %>% parallel::mclapply(function(x){ zScore(fitPM[[x]], fitLambda[[x]], fitDelta[[x]], exons = droplevels(filter(fitUgc, unit ==x))$group) }, mc.cores = 3) %>% set_names(names(fitPM)) exonZ <- fitZ %>% lapply(function(x){x$exon}) %>% unlist %>% as.data.frame() %>% set_names("Z") %>% rownames_to_column("unitID") %>% as_data_frame() %>% mutate(groupID = gsub(".+\\.(ENSG[0-9_]+)", "\\1", unitID), unitID = gsub("(ENSG[0-9]+)\\..+", "\\1", unitID)) %>% dplyr::select(unitID, groupID, Z)
A suitable value for inclusion of an exon-level group may be $Z_j > 50$, and a 95% CPI which excludes the range $\pm \kappa$ for some real value $\kappa$, e.g. $\kappa = 0.2$
phiLogFC %>% separate(groupID, into = c("unitID", "groupID"), sep = "\\.") %>% left_join(exonZ) %>% arrange(desc(abs(B)))
If wanting to fit only a single unit, this is also possible by simply running the command fitBmeaSingle()
.
This will fit on a single node, calling the underlying C
code, but without writing the values to disk.
unit <- 1 myFit <- fitBmeaSingle(csN, bgCelSet, unit, conditions, contrastMatrix, mcmcParam = mcmcParam)
To inspect the fold change at the gene level
myFit$logFC %>% as.data.frame() %>% rownames_to_column("Comparison") %>% ggplot(aes(`50%`, 1)) + geom_point() + geom_errorbarh(aes(xmin = `2.5%`, xmax = `97.5%`)) + geom_vline(xintercept = 0, colour = "blue", linetype = 2) + labs(x = "95% CPI", y = c()) + facet_wrap(~Comparison) + theme_bw() + theme(axis.text.y = element_blank(), axis.title.y = element_blank())
Similarly at the exon level
myFit$phiLogFC$BVsA %>% as.data.frame() %>% rownames_to_column("groupID") %>% dplyr::select(groupID, `2.5%`, median = `50%`, `97.5%`) %>% left_join(exonZ) %>% mutate(groupID = str_extract(groupID, "_[0-9]+"), candidate = Z > 50) %>% ggplot(aes(median, groupID, colour = candidate))+ geom_point() + geom_errorbarh(aes(xmin = `2.5%`, xmax = `97.5%`)) + geom_vline(xintercept = 0, colour = "red", linetype = 2) + scale_colour_manual(values = c("grey50", "green")) + labs(x = "95% CPI", colour = "Z > 50") + theme_bw()
pander(sessionInfo())
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.