In byandell/CausalMST: Causal Model Selection Hypothesis Tests

knitr::opts_chunk$set(echo = TRUE, comment = "")

Motivation

Current efforts in systems genetics have focused on the development of statistical approaches that aim to disentangle causal relationships among molecular phenotypes in segregating populations. Model selection criterions, such as the AIC and BIC, have been widely used for this purpose, in spite of being unable to quantify the uncertainty associated with the model selection call. In this tutorial we illustrate the use of software implementing the causal model selection hypothesis tests proposed by Chaibub Neto et al. (2012).

Overview

This tutorial illustrates the basic functionality of the CMST routines in the qtlhot R package using few simulated toy examples. The analysis of a yeast genetical genomics data-set presented in Chaibub Neto et al. (2012) is reproduced in a separate package, R/qtlyeast. The R/qtlhot package depends on R/qtl (Broman et al. 2003), and we assume the reader is familiar with it.

Basic functionality

Here, we illustrate the basic functionality of the CMST routines in the R/qtlhot package in a toy simulated example.

library(CausalMST)

This package was developed with quantitative trait loci in mind. However, its use is not limited to such experiments, which is why we are now spawning a separate package. The first example uses made up data with a simple model shown below.

CMST with small example

Consider two outcomes $y_1, y_2$ with residual variances 0.4 for $y_1$ and 0.1 for the other phenotype. We set up the causal relations $y_1 = z + \epsilon_1$ and $y_2 = 0.5 \, y_1 + \epsilon_2$.

n_ind <- 1000
z <- sample(0:1, n_ind, replace = TRUE, prob = rep(0.5,2))
y1.z <- z + sqrt(0.4) * rnorm(n_ind)
y2.y1 <- 0.5 * y1.z + sqrt(0.1) * rnorm(n_ind)

The four models to be compared are $M_1$ ($z \rightarrow y_1 \rightarrow y_2$ as in the above example), $M_2$ ($z \rightarrow y_2 \rightarrow y_1$), $M_3$ ($y_1 \leftarrow z \rightarrow y2$), and $M_4$ as in Chaibub Neto et al. (2012) and returns concise tables along with summary statistics. Recall that $M_4$ is similar to $M_3$ allows extra dependence between y1 and y2 (with the 3 submodels of causal direction or association among $y_1$ and $y_2$ being likelihood equivalent).

dat <- data.frame(z = z,
                   y1 = y1.z, 
                   y2 = y2.y1)

library(ggplot2)
grid::grid.newpage()
grid::pushViewport(
  grid::viewport(
    layout = grid::grid.layout(ncol = 2)))

print(ggplot(dat, aes(x=y1.z, y=y2.y1, col=factor(z))) + 
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", se = FALSE),
  vp = grid::viewport(layout.pos.row = 1,
                      layout.pos.col = 1))
print(ggplot(dat, aes(y=y1.z, x=y2.y1, col=factor(z))) + 
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", se = FALSE),
  vp = grid::viewport(layout.pos.row = 1,
                      layout.pos.col = 2))

(result <- cmst(driver = as.matrix(dat$z), outcomes = dat[-1]))

The result shows that the correct model $M_1$ ($y_2\leftarrow y_1 \leftarrow z$) has the smallest p-value among the four models. The nonpar tests do the best here since the others assume normality whereas, for instance, the outcome $y_1$ is bimodal due to the influence of the qualitative driver $z$.

CMST using QTL data

The second exampleis built from an R/qtl cross object for illustration with the geno` as driver.

We first use the SimCrossCausal function to simulate a cross object with 3 phenotypes, $y_1$, $y_2$ and $y_3$, where $y_1$ has a causal effect on both $y_2$ and $y_3$. The simulated cross data set, Cross, is composed of: 100 individuals (n.ind = 100}); 3 chromosomes of length 100cM (len = rep(100, 3)); 101 unequally spaced markers per chromosome (n.mar = 101andeq.spacing = FALSE); additive genetic effect set to 1 (add.eff = 1); dominance genetic effect set to 0 (dom.eff = 0}); residual variances for $y_1$ (sig2.1) and the other phenotypes (sig2.2) set to 0.4 and 0.1, respectively; backcross cross type (cross.type = "bc"); and phenotype data transformed to normal scores (normalize = TRUE). The argument `beta = rep(0.5, 2)}, represents the causal effect of $y_1$ on the other phenotypes (i.e., coefficients of the regressions of $y_2 = 0.5 \, y_1 + \epsilon$ and $y_3 = 0.5 \, y_1 + \epsilon$). The length of beta controls the number of phenotypes to be simulated.

library(qtl)

set.seed(987654321)
CMSTCross <- SimCrossCausal(n.ind = 100,
                        len = rep(100, 3),
                        n.mar = 101,
                        beta = rep(0.5, 2),
                        add.eff = 1,
                        dom.eff = 0,
                        sig2.1 = 0.4,
                        sig2.2 = 0.1,
                        eq.spacing = FALSE,
                        cross.type = "bc",
                        normalize = TRUE)

We compute the genotype conditional probabilities using Haldane's map function, genotype error rate of 0.0001, and setting the maximum distance between positions at which genotype probabilities were calculated to 1cM.

CMSTCross <- qtl::calc.genoprob(CMSTCross, step = 1)
nms <- names(CMSTCross$pheno)

Direct use of cmst

The helper routine qtl2cmst will convert QTL data into a driver, outcomes and covariates. We at the moment take advantage of knowing were the QTL lies.

Here, we perform the causal model selection tests for phenotypes $y_1$ and $y_2$ using the qtl2cmst and cmst functions. There is a wrapper for these called CMSTtests (not shown). The Q.chr and Q.pos arguments specify the chromosome and position (in cM) of the QTL to be used as a causal anchor. The argument method specify which version of the CMST test should be used. The options "par", "non.par" and "joint" represent, respectively, the parametric, non-parametric, joint parametric versions of the CMST test. The option "all" fits all three versions. The penalty argument specifies whether we should test statistics based on the AIC ("aic"), BIC ("bic"), or both ("both") penalties. In this particular call we computed all 3 versions using both penalties fitting 6 separate CMST tests.

setup <- qtl2cmst(CMSTCross,
                  pheno1 = nms[1],
                  pheno2 = nms[2],
                  Q.chr = 1,
                  Q.pos = 55)

Here is the CMST performed on the converted data.

cmst(setup$driver, setup$outcomes, 
       setup$covariates, 
       setup$addcov, setup$intcov, 
       method = "all", penalty = "both")

QTL mapping and CMST tests

We perform QTL mapping using Haley-Knott regression (Haley and Knott 1992), and summarize the results for the 3 phenotypes. Figure \ref{lod.profiles} presents the LOD score profiles for all 3 phenotypes. The black, blue and red curves represent the LOD profiles of phenotypes $y_1$, $y_2$ and $y_3$, respectively.

Scan <- qtl::scanone(CMSTCross, pheno.col = 1 : 3, method = "hk")
summary(Scan, thr = 3, format = "allpheno")

plot(Scan, lodcolumn = 1 : 3, ylab = "LOD")

LOD score profiles for phenotypes $y_1$ (black curve), $y_2$ (blue curve) and $y_3$ (red curve).

Phenotypes $y_1$ and $y_2$ map to exactly the same QTL at position r round(summary(Scan, 3, "allpheno")[1,"pos"], 1) cM on chromosome 1. Phenotype $y_3$ maps to a QTL at position r round(summary(Scan, 3, "allpheno")[2,"pos"], 1) cM. Whenever two phenotypes map to nearby, but not exactly identical, positions we are faced with the question of which QTL to use as causal anchor. Instead of making a (sometimes) arbitrary choice, our approach is to compute the joint LOD profile of both phenotypes and use the QTL detected by this joint mapping approach as the causal anchor. This is pursued in package R/qtlhot.

Multiple phenotypes

The function CMSTtests also computes CMST tests of a single phenotype against a list of phenotypes. Its output is less detailed though. In this particular call we test $y_1$ against $y_2$ and $y_3$.

setup <- qtl2cmst(CMSTCross,
                  pheno1 = nms[1],
                  pheno2 = nms[2:3],
                  Q.chr = 1,
                  Q.pos = 55)

Here is the CMST performed on the converted data.

cmst(setup$driver, setup$outcomes, 
       setup$covariates, 
       setup$addcov, setup$intcov, 
       method = "all", penalty = "both")

Other Functions

There are several other functions involved in simulation and in data analysis that are not well documented yet. See R/qtlhot from CRAN and R/qtlyeast available at https://github.com/byandell/qtlyeast for further analysis.

References

Broman K., H. Wu, S. Sen, G. A. Churchill, 2003 R/qtl: QTL mapping in experimental crosses. Bioinformatics {19}: 889-890.
Chaibub Neto E, Broman AT, Keller MP, Attie AD, Zhang B, Zhu J, Yandell BS (2013) Modeling causality for pairs of phenotypes in system genetics. Genetics 193 : 1033-1013.
Churchill G. A., R. W. Doerge, 1994 Empirical threshold values for quantitative trait mapping. Genetics {138}: 963-971.
Schadt E. E., J. Lamb, X. Yang, J. Zhu, S. Edwards, et al., 2005 An integrative genomics approach to infer causal associations between gene expression and disease. Nature Genetics {37}: 710-717.
Zhu J., B. Zhang, E. N. Smith, B. Drees, R. B. Brem, L. Kruglyak, R. E. Bumgarner, E. E. Schadt, 2008 Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nature Genetics {40}: 854-861.