\newcommand{\vect}[1]{\boldsymbol{\mathbf{#1}}}

Statistical models for super-enhancers

Activity model is $A(\vect{x}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n$.

Linear model is $R(\vect{x}) = e^{A(\vect{x})} + \epsilon$.

Exponential model is $R(\vect x) = e^{A(\vect x)} + \epsilon$.

Logistic model is $R(\vect x)=1/\left(1+e^{-A(\vect{x})}\right)+\epsilon$.

Fitting your first model

To start fitting statistical models to your enhancer data you need to format your data in a particular way. First you need to place information about which enhancers are present in each experiment in a data.frame where each row is an experiment and each column is an enhancer element. Here's a simple example using the data from Shin et al. (2015):

library(superEnhancerModelR)
data("wap")
design = wap[,c(3:5)]
head(design)

Then we need a vector of normalized expression values, with each element of the vector corresponding to one row of the previous data frame.

## Create "expression data""
expression = wap[,2]
print(head(expression))

Once we have both expression values and a design matrix, we can construct an enhancerDataObject. This object holds the raw data, the design of each experiment, and detail of the statistical model that you want to fit. You can specify the link function: additive, exponential, or logistic (from Dukler et al., 2017). You can also specify an error model: Gaussian, or log-normal. Upper and lower bounds for all parameters can also be specified although we will leave them at the defaults for now. We specify a model with a logistic link function, log-normal error, and non-interacting enhancer loci as follows:

## Create activity function
actFun = formula(~E1+E2+E3)
independent.loci = enhancerDataObject(expression,design,actFun,errorModel="lognormal",linkFunction="logistic")

This model can then be fit using a differential evolutionary algorithm and refined with gradient descent (note that this step could be run with more iterations.):

independent.loci=optimDE(independent.loci,maxit=2000,refine=TRUE,threads = 6,control = list(trace=500))

Once you have a fitted model you can plot the data and the predicted model. All plot functions return ggplot objects, so if you save them you can manipulate them in exactly the same way.

iplot = plotModel(independent.loci)
plot(iplot)

You can also view the model residuals

plotResiduals(independent.loci)

Testing for enhancer interactions

Say we want to test whether E1 and E2 interact. Then we augment the independent model with an interaction term (E1:E2), and fit it. We can view the fitted interaction model in the same way we viewed the independent loci model. In the context of formulas,the "*" operator produces all combinations of variables. e.g E1*E2*E3 produces E1+E2+E3+E1:E2+E1:E3+E2:E3+E1:E2:E3, which it typically too many parameters to fit at once without overfitting. (Note: You can also use the "|" operator which expresses the conditional relationship. e.g. E1|E2 is equivalent to E1|E2==TRUE)

actFun2 = formula(~E1+E2+E1:E2)
dependent.loci = enhancerDataObject(expression,design,actFun2,errorModel="lognormal",linkFunction="logistic")
dependent.loci=optimDE(dependent.loci,maxit=2000,refine=TRUE,threads = 6,control = list(trace=FALSE))
plotModel(dependent.loci)

Then we can compare their respective Bayesian information criterion (BIC). Lower BIC indicates a better model.

paste("The BIC for the interacting loci model is:",bic(dependent.loci))
paste("The BIC for the independent loci model is:",bic(independent.loci))

The significance of a difference in Bayes factor can be interpreted using this table.

bic.tab=data.frame(delta_bic=c("0 to 2","2 to 6","6 to 10",">10"),Evidence=c("Not worth more than a bare mention","Positive","Strong","Very Strong"))
colnames(bic.tab)=c("$\\Delta BIC$","Evidence against higher BIC")
knitr::kable(bic.tab, booktabs = TRUE, escape = FALSE)

Deriving coefficient specific p-values

If you want p-values for individual coefficients, run the function computeParamPvals and it will calculate the significance of the each parameter via the likelihood-ratio test. Note that since the model will need to be re-trained for each parameter removed, it is a good idea to set the number of training iterations to be the same as the number used to train the original model.

param.stats=computeParamPvals(independent.loci,maxit=2000,threads = 6,control = list(trace=FALSE))
print(param.stats$pVals)

Looking at orientation specific effects

If you're measuring the effects of enhancers in a reporter assay, you may be interested to check if the orientation of the insert has an effect on expression. Create simulated data where orientation has no effect. In order to do that, include an orientation column, which has a 0 for one orientation and a 1 for the other. The orientation column can be thought of as encoding an activity difference from the "base" orientation.Here we construct an artificial example where there is no effect of orientation and fit it:

set.seed(123)
design=data.frame(expression=c(rnorm(n = 10,mean = 100,sd=10),rnorm(n = 5,mean = 10,sd=1)),
              E1=c(rep(1,10),rep(0,5)),orientation=c(rep(c(0,1),each=5),rep(0,5)))
actFun=formula(~E1+E1:orientation)
orientation.effects = enhancerDataObject(expressionData = design[,1],designInfo = 
                                           design[,-1],actFun,errorModel="gaussian",linkFunction="exponential")
orientation.effects=optimDE(orientation.effects,maxit=2000,refine=TRUE,control=list(trace=FALSE))

Then we can compute p-values for each term, if the orientation interaction term is significant, there are orientation effects.

orientation.param.stats=computeParamPvals(orientation.effects,maxit=2000,threads = 6,control=list(trace=FALSE))
orientation.param.stats$pVals


ndukler/superEnhancerModelR documentation built on May 17, 2019, 8:18 p.m.