A standard probablistic phylogenetic model for discrete traits describes a species phylogeny with observed trait values at the tips of the tree and unobserved trait values at the ancestral nodes. In this framework the turnover rate ($\gamma$) describes the rate at which "mutation" events happen on the tree, and the allelic stationary distribution ($\pi$) describes the prior on allelic states $s$ at the root of the tree. phyloGLM modifies the standard phylogenetic model by parameterizing $\gamma$ and $\pi$ as functions of a set of genomic covariates. Furthermore, phyloGLM can accommodate probabilistic allele values at the tips of the tree by taking conditional probabilities of the raw data given the allelic state ($P(x|s)$). This can be highly useful in the presence of poor cross species alignments or potential error in calling the trait value.

Installation

Currently, phyloGLM is not available on CRAN so the only way to install it is from the GitHub repo. To install it directly you need to have devtools installed then run devtools::install_github("ndukler/phyloGLM"). You can also clone the phyloGLM repo and install from source if desired. This package is still being actively developed so reccomendations on ways to improve documentation and bug reports are very welcome.

Constructing your first rate model

We need several elements to construct a rate model. The first is a tree, which, in this case, we read in from newick format using read.tree from the ape package.

library(ggplot2)
library(cowplot)
library(scales)
library(reshape2)
library(data.table)
library(ape)
library(phyloGLM)

## Read in a sample tree
tree=read.tree(system.file("extdata", "sampleTree.nh", package="phyloGLM"))
plot(tree)

Next we need set of covariates. Note that we also included a site id as the first column, this will be useful in a moment.

## Read in covarites associated with each site 
siteLabels=fread(system.file("extdata", "siteCovariates.txt", package="phyloGLM"))
head(siteLabels)

Last we need to read in the allele data. It's easiest to do this when the allele data is stored in text file where the first two columns are site and species respectively, and all other columns corresond the the conditional allele probabilities as shown below:

writeLines(readLines(system.file("extdata", "samplePAF.txt", package="phyloGLM"),3))

If in this format, it can be read into the format required to construct and alleleData object using the readPAF() function as follows. The subset argument guarentees that only the sites of interest are read in and matches their order in the subset vector. We can use value of the siteLabels$site to ensure the row order of the allele data matches the covariates.

## Read in allele data
aData=readPAF(system.file("extdata", "samplePAF.txt", package="phyloGLM"),subset = siteLabels$site)

Now create the allele data object:

ad <- alleleData(data = aData$dataList, tree = tree, siteInfo = siteLabels)

Now create a rate model containing the data from the alleleData object, and providing the linear formulas for the rate and pi link functions (note that if no piFormula is supplied it defaults to using the same one as the rate formula).

## Set the rate parameter formula
rateFormula <- formula(~ cre.class + logXpress + 0)
## Set rate bounds
rBounds <- c(10^-2, 10)
## Build rateModel
rateMod <- rateModel(data = ad, rateFormula = rateFormula, rateBounds = rBounds)

phyloGLM is multithreaded

Note that the likelihood calculations in this package scale well when parallelized up to one or two less than the number of available threads on your system. The threads argument can also be used with several other functions, including the fit function to speed up model fitting.

m <- microbenchmark::microbenchmark(
  ll(model = rateMod, threads = 1),
  ll(model = rateMod, threads = 2),
  ll(model = rateMod, threads = 3)
)
m <- as.data.frame(m)
ggplot(m, aes(x = factor(as.numeric(m$expr)), y = time / 10^6)) +
  geom_boxplot() +
  ylab("Time for log-likelihood evaluation (ms)") +
  xlab("Threads") +
  ylim(0, max(m$time / 10^6))+
  theme_cowplot()

Fitting and analysing a rateModel

Now we can fit the model:

fitted_status <- fit(model = rateMod, threads = 3,method = "bfgs", hessian = TRUE)

Once fitted we can extract the parameters, we can use the estimated hessiantheir approximate standard errors and plot them. The rate parameters are fairly straightforward to interper, coefficients which are greater than zero are associated with an increased turnover rate, while coefficients less than zero are associated with a decreased turnover rate. The allelic parameters are more difficult to interpert, but essentially for a coefficient $\beta_{ij}>0$ where $i$ is the allele and $j$ is the feature, feature $j$ is positively correlated with an increased frequency of allele $i$ relative to the base allele.

seTab <- se(rateMod, fitted_status$hessian)
param_plots <- plotParams(seTab)
plot_grid(param_plots[[1]] + theme(axis.text.x = element_text(angle = 45, hjust = 1)),
  param_plots[[2]] + theme(axis.text.x = element_text(angle = 45, hjust = 1)),
  align = "h"
)

Because of the way the rate is parameterized, the exact values of the rate coefficients depend heavily on the upper and lower bounds for the rate (although the relative impacts are still interpertable). Thus it can be valuable to view the marginal distribution of rate estimates with respect each covariate. If we want to see the distribution of rates for specific covariate values we can compute the rates at desired sites for all edgeGroups and plot them. From this we can see the behavior of the rates with respect to the covariates on the data.

## Compute the rate distribution
rDist <- rates(rateMod)
## Extract the site information
si <- getSiteInfo(rateMod)
si[, site := 1:nrow(.SD)]
## Merge the rates and the site information
rMerged <- merge(si, rDist, by = "site")
## Plot rate dsitribution, split by enhancers and promoters
g1 <- ggplot(rMerged, aes(x = cre.class, y = rate, fill = factor(edgeGroup))) +
  geom_boxplot(notch = TRUE) +
  theme_cowplot()+
  theme(legend.position = "bottom")
g2 <- ggplot(rMerged, aes(x = logXpress, y = rate, fill = factor(edgeGroup))) +
  geom_point() +
  theme_cowplot()+
  theme(legend.position = "bottom")
plot_grid(g1, g2, align = "h")

We can also compute the number of turnover events, and plot them on a tree if we want.

## Plot the expected number of transitions on the edges of the branches
plotElementTransitions(rateMod)
## A seperate function for getting a table of expected transitions.
## Use if the actual numbers of expected transitions are desired.
## phyloGLM::marginalTransitions(rateMod,aggregate = "edge")

To get explicit p-values for specific parameters we can use the likelihood ratio test (since estimating the standard error via the hessian may not be reliable). To do that we must fit two models, one with (alternate model), and one without (null model) the parameter of interest (in this case the alternate model is the model we previously fit). Note how we are careful to set the formulas for the rate and stationary distributions seperately to ensure that the models differ by only one parameter. The models must also be nested in order for this test to be valid. A small p-value rejects the null model, concluding that the more complex model fits the data better, even given the additional degrees of freedom.

nullFormula <- formula(~ cre.class + 0)
alternateFormula <- formula(~ cre.class + logXpress + 0)
rateModNull = copy(rateMod)
forceParameterZero(rateModNull,index = 2)
fitted_status_alternate <- fit(rateModNull, threads = 3,control = list(trace = 0))
lrt_expression <- lrt(h0 = rateModNull, hA = rateMod)
print(lrt_expression)

Models may also be compared usng the baysian information criterion (BIC) which does not require that the models be nested. The lower the BIC the better the model, in other words if BIC(alternate) < BIC(null), we can conclude that the alternate model provides a better fit to the data. In this case we perform BIC(alternate)-BIC(NULL), see a large negative value, and conclude that the alternate model is better.

bic(rateMod) - bic(rateModNull)

Saving a rateModel for later exploration

Lastly you can save any model using a combination of the pack() and saveRDS() functions. The model can then be recreated with the unpack command and you're good to go!

pMod <- pack(rateMod)
## Saving and reloading the packed model (can use compression as well)
# saveRDS(object = pMod,file = "~/model.RData.gz",compress = "gzip")
# pMod=readRDS(file = "~/model.RData.gz")
rateMod2 <- unpack(pMod)


ndukler/phyloGLM documentation built on Sept. 25, 2019, 9:17 p.m.