knitr::opts_chunk$set(echo = TRUE, comment = "")
Current efforts in systems genetics have focused on the development of statistical approaches that aim to disentangle causal relationships among molecular phenotypes in segregating populations. Model selection criterions, such as the AIC and BIC, have been widely used for this purpose, in spite of being unable to quantify the uncertainty associated with the model selection call. In this tutorial we illustrate the use of software implementing the causal model selection hypothesis tests proposed by Chaibub Neto et al. (2012).
This tutorial illustrates the basic functionality of the CMST routines in the qtlhot
R package using few simulated toy examples. The analysis of a yeast genetical genomics data-set presented in Chaibub Neto et al. (2012) is reproduced in a separate package, R/qtlyeast
. The R/qtlhot
package depends on R/qtl
(Broman et al. 2003), and we assume the reader is familiar with it.
Here, we illustrate the basic functionality of the CMST routines in the R/qtlhot
package in a toy simulated example.
library(CausalMST)
This package was developed with quantitative trait loci in mind. However, its use is not limited to such experiments, which is why we are now spawning a separate package. The first example uses made up data with a simple model shown below.
Consider two outcomes $y_1, y_2$ with residual variances 0.4 for $y_1$ and 0.1 for the other phenotype. We set up the causal relations $y_1 = z + \epsilon_1$ and $y_2 = 0.5 \, y_1 + \epsilon_2$.
n_ind <- 1000 z <- sample(0:1, n_ind, replace = TRUE, prob = rep(0.5,2)) y1.z <- z + sqrt(0.4) * rnorm(n_ind) y2.y1 <- 0.5 * y1.z + sqrt(0.1) * rnorm(n_ind)
The four models to be compared are $M_1$ ($z \rightarrow y_1 \rightarrow y_2$ as in the above example), $M_2$ ($z \rightarrow y_2 \rightarrow y_1$), $M_3$ ($y_1 \leftarrow z \rightarrow y2$), and $M_4$ as in Chaibub Neto et al. (2012) and returns concise tables along with summary statistics. Recall that $M_4$ is similar to $M_3$ allows extra dependence between y1 and y2 (with the 3 submodels of causal direction or association among $y_1$ and $y_2$ being likelihood equivalent).
dat <- data.frame(z = z, y1 = y1.z, y2 = y2.y1)
library(ggplot2) grid::grid.newpage() grid::pushViewport( grid::viewport( layout = grid::grid.layout(ncol = 2))) print(ggplot(dat, aes(x=y1.z, y=y2.y1, col=factor(z))) + geom_point(alpha = 0.2) + geom_smooth(method = "lm", se = FALSE), vp = grid::viewport(layout.pos.row = 1, layout.pos.col = 1)) print(ggplot(dat, aes(y=y1.z, x=y2.y1, col=factor(z))) + geom_point(alpha = 0.2) + geom_smooth(method = "lm", se = FALSE), vp = grid::viewport(layout.pos.row = 1, layout.pos.col = 2))
(result <- cmst(driver = as.matrix(dat$z), outcomes = dat[-1]))
The result
shows that the correct model $M_1$ ($y_2\leftarrow y_1 \leftarrow z$) has the smallest p-value among the four models. The nonpar
tests do the best here since the others assume normality whereas, for instance, the outcome $y_1$ is bimodal due to the influence of the qualitative driver $z$.
The second exampleis built from an R/qtl cross object for illustration with the geno` as driver.
We first use the SimCrossCausal
function to simulate a cross
object with 3 phenotypes, $y_1$, $y_2$ and $y_3$, where $y_1$ has a causal effect on both $y_2$ and $y_3$. The simulated cross data set, Cross
, is composed of: 100 individuals (n.ind = 100}); 3 chromosomes of length 100cM (
len = rep(100, 3)); 101 unequally spaced markers per chromosome (
n.mar = 101and
eq.spacing = FALSE); additive genetic effect set to 1 (
add.eff = 1); dominance genetic effect set to 0 (
dom.eff = 0}); residual variances for $y_1$ (sig2.1
) and the other phenotypes (sig2.2
) set to 0.4 and 0.1, respectively; backcross cross type (cross.type = "bc"
); and phenotype data transformed to normal scores (normalize = TRUE
). The argument `beta = rep(0.5, 2)}, represents the causal effect of $y_1$ on the other phenotypes (i.e., coefficients of the regressions of $y_2 = 0.5 \, y_1 + \epsilon$ and $y_3 = 0.5 \, y_1 + \epsilon$). The length of beta controls the number of phenotypes to be simulated.
library(qtl)
set.seed(987654321) CMSTCross <- SimCrossCausal(n.ind = 100, len = rep(100, 3), n.mar = 101, beta = rep(0.5, 2), add.eff = 1, dom.eff = 0, sig2.1 = 0.4, sig2.2 = 0.1, eq.spacing = FALSE, cross.type = "bc", normalize = TRUE)
We compute the genotype conditional probabilities using Haldane's map function, genotype error rate of 0.0001, and setting the maximum distance between positions at which genotype probabilities were calculated to 1cM.
CMSTCross <- qtl::calc.genoprob(CMSTCross, step = 1) nms <- names(CMSTCross$pheno)
The helper routine qtl2cmst
will convert QTL data into a driver, outcomes and covariates.
We at the moment take advantage of knowing were the QTL lies.
Here, we perform the causal model selection tests for phenotypes $y_1$ and $y_2$ using the qtl2cmst
and cmst
functions. There is a wrapper for these called CMSTtests
(not shown). The Q.chr
and Q.pos
arguments specify the chromosome and position (in cM) of the QTL to be used as a causal anchor. The argument method
specify which version of the CMST test should be used. The options "par"
, "non.par"
and "joint"
represent, respectively, the parametric, non-parametric, joint parametric versions of the CMST test. The option "all"
fits all three versions. The penalty
argument specifies whether we should test statistics based on the AIC ("aic"
), BIC ("bic"
), or both ("both"
) penalties. In this particular call we computed all 3 versions using both penalties fitting 6 separate CMST tests.
setup <- qtl2cmst(CMSTCross, pheno1 = nms[1], pheno2 = nms[2], Q.chr = 1, Q.pos = 55)
Here is the CMST performed on the converted data.
cmst(setup$driver, setup$outcomes, setup$covariates, setup$addcov, setup$intcov, method = "all", penalty = "both")
We perform QTL mapping using Haley-Knott regression (Haley and Knott 1992), and summarize the results for the 3 phenotypes. Figure \ref{lod.profiles} presents the LOD score profiles for all 3 phenotypes. The black, blue and red curves represent the LOD profiles of phenotypes $y_1$, $y_2$ and $y_3$, respectively.
Scan <- qtl::scanone(CMSTCross, pheno.col = 1 : 3, method = "hk") summary(Scan, thr = 3, format = "allpheno")
plot(Scan, lodcolumn = 1 : 3, ylab = "LOD")
LOD score profiles for phenotypes $y_1$ (black curve), $y_2$ (blue curve) and $y_3$ (red curve).
Phenotypes $y_1$ and $y_2$ map to exactly the same QTL at position r round(summary(Scan, 3, "allpheno")[1,"pos"], 1)
cM on chromosome 1. Phenotype $y_3$ maps to a QTL at position r round(summary(Scan, 3, "allpheno")[2,"pos"], 1)
cM. Whenever two phenotypes map to nearby, but not exactly identical, positions we are faced with the question of which QTL to use as causal anchor. Instead of making a (sometimes) arbitrary choice, our approach is to compute the joint LOD profile of both phenotypes and use the QTL detected by this joint mapping approach as the causal anchor. This is pursued in package R/qtlhot
.
The function CMSTtests
also computes CMST tests of a single phenotype against a list of phenotypes. Its output is less detailed though. In this particular call we test $y_1$ against $y_2$ and $y_3$.
setup <- qtl2cmst(CMSTCross, pheno1 = nms[1], pheno2 = nms[2:3], Q.chr = 1, Q.pos = 55)
Here is the CMST performed on the converted data.
cmst(setup$driver, setup$outcomes, setup$covariates, setup$addcov, setup$intcov, method = "all", penalty = "both")
There are several other functions involved in simulation and in data analysis that are not well documented yet.
See R/qtlhot
from CRAN and R/qtlyeast
available at https://github.com/byandell/qtlyeast for further analysis.
Broman K., H. Wu, S. Sen, G. A. Churchill, 2003 R/qtl: QTL mapping in experimental crosses. Bioinformatics {19}: 889-890.
Chaibub Neto E, Broman AT, Keller MP, Attie AD, Zhang B, Zhu J, Yandell BS (2013) Modeling causality for pairs of phenotypes in system genetics. Genetics 193 : 1033-1013.
Churchill G. A., R. W. Doerge, 1994 Empirical threshold values for quantitative trait mapping. Genetics {138}: 963-971.
Schadt E. E., J. Lamb, X. Yang, J. Zhu, S. Edwards, et al., 2005 An integrative genomics approach to infer causal associations between gene expression and disease. Nature Genetics {37}: 710-717.
Zhu J., B. Zhang, E. N. Smith, B. Drees, R. B. Brem, L. Kruglyak, R. E. Bumgarner, E. E. Schadt, 2008 Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nature Genetics {40}: 854-861.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.