Here we present an update of the R package for the Combined Analysis of Epistasis
and Pleiotropy, or cape
. This package implements a method, originally described
in @Carter:2012fd, that infers directed interaction networks between genetic
variants for predicting the influence of genetic perturbations on quantitative
traits. This method takes advantage of complementary information in partially
pleiotropic genetic variants to resolve directional influences between variants
that interact epistatically. cape
can be applied to a variety of genetic variants,
such as single nucleotide polymorphisms (SNPs), copy number variations (CNVs)
or structural variations (SVs). Here we demonstrate the functionality of cape
by inferring a predictive network between quantitative trait loci (QTL) in a
cross between the non-obese, non-diabetic (NON) mouse and the New Zealand obese
(NZO) mouse [@Reifsnyder:2000is].
To install cape use:
install.packages("cape")
After installation load cape with the following command:
set.seed(1234) library(cape)
The current version of cape recognizes data in multiple formats: R/qtl, R/qtl2, and PLINK.
Data for the demonstrations below can be found in the demo folder CAPE github site: https://github.com/TheJacksonLaboratory/cape
Use the green "Code" button in the upper right of the page to download the repository and put it where you keep code or projects or other things of this nature. The top level of the folder has a .here file in it from the here package. This will anchor any R session opened in this folder to the top level and path names can be constructed using just folder names. For example, to create a path that points to the parameter file in the demo_PLINK folder, use the following command:
param_file_path <- here("demo", "demo_PLINK", "0_plink.parameters_0.yml")
Open R and set the working directory to cape_master after downloading
the code linked above. This will allow the relative paths constructed
with here()
in the demo code to find the data and parameter files it
needs.
Also make sure the package here is installed and loaded.
install.packages("here") library("here")
library("here")
In addition to supporting the newer R/qtl2 (@broman2019r) format, we also support
the R/qtl (@rqtl) csv format for cape. You can learn more about this format
at the R/qtl website. Briefly, this format
contains all traits and genotypes in one comma-delimited file.
The first few columns contain traits, covariates, and individual identifiers,
and the remaining contain genotypes. Please see the qtl demo data included
with cape for an example. Also see ?read_population
for more details.
To read in data with this format, we use read_population
, followed
by cape2mpp
, which updates the object to the newer cape format.
The two final objects, obesity_cross
and obesity_geno
hold all
the data that cape
needs to run. obesity_cross
holds phenotype
and metadata, and obesity_geno
holds genotype data. These can be
use in the pheno_obj
and geno_obj
arguments in the run_cape()
function, which we will use later on.
data_file <- here("demo", "demo_qtl", "data", "NON_NZO_Reifsnyder_pgm_CAPE_num.csv") param_file <- here("demo", "demo_qtl", "0_NON_NZO.parameters_0.yml") cross <- read_population(data_file) cross_obj <- cape2mpp(cross) obesity_cross <- cross_obj$data_obj obesity_geno <- cross_obj$geno_obj$geno
Many researchers are now using the updated R/qtl2 (@broman2019r), and we also provide ways to handle data in this format. If you are not familiar with qtl2, we highly recommend reading about it and testing it out. R/qtl2 is the definitive genetic mapping package, and the outstanding documentation can answer many basic questions about genetic mapping.
The following code shows how to access remote example data provided by Karl Broman
for the qtl2 package. For both accessing remote data, or local data, we use
the read_cross2
function from qtl2 to read in the data, and then qtl2_to_cape
to convert the qtl2 object to a cape object.
iron_qtl2 <- read_cross2("https://kbroman.org/qtl2/assets/sampledata/iron/iron.yaml") iron_cape <- qtl2_to_cape(cross = iron_qtl2) data_obj <- iron_cape$data_obj geno_obj <- iron_cape$geno_obj
Completely new to this version of cape is the ability to load files formatted for PLINK (@purcell2007plink). This enables broader use of cape by researchers who are interested in studying genetic interactions in human cohorts.
We have provided example data with the cape demo_plink. Please see the PLINK website for more information about PLINK and formatting data for PLINK.
For cape, we require a ped file, a map file, and a phenotype file. PLINK standard format includes only one trait, and cape requires at least two traits. We therefore need the phenotype file in the pheno.txt format as described here.
To generate the data for the demo included in cape, we downloaded example PLINK hapmap data.
To reduce the size of the data set for demonstration purposes, we sampled 1000 the SNPs in the data set using PLINK:
./plink --file hapmap1 --make-bed --thin-count 1000 --recode --out test
We generated two random, normally distributed phenotypes for the phenotype file. These are not particularly interesting, but they do demonstrate what the files should look like.
The following code shows how to read in PLINK data, use the function
plink2cape
to read in the map, ped, and phenotype files.
This code is also included in demo_PLINK.R, which is included with cape.
data_path <- here("demo", "demo_PLINK", "data") ped <- file.path(data_path, "test.ped") map <- file.path(data_path, "test.map") pheno <- file.path(data_path, "test.pheno") out <- file.path(data_path, "test.csv") param_file <- here("demo", "demo_PLINK", "0_plink.parameters_0.yml") cross_obj <- plink2cape(ped, map, pheno, out = "out.csv") data_obj <- cross_obj$data_obj geno_obj <- cross_obj$geno_obj$geno
In this vignette, we will reanalyze a data set described in @Reifsnyder:2000is. This experiment was performed to identify quantitative trait loci (QTL) for obesity and other risk factors of type II diabetes in a reciprocal back-cross of non-obese non-diabetic NON/Lt mice and the diabetes-prone, New Zealand obese (NZO/HILt) mice. The study found multiple main-effect QTL influencing phenotypes associated with diabetes and obesity as well as multiple epistatic interactions. In addition, maternal environment (i.e. whether the mother was obese) was found to interact with several markers and epistatic pairs to influence the risk of obesity and diabetes of the offspring. The complex nature of diabetes and obesity, along with their complex and polygenic inheritance patterns, make this data set ideal for an analysis of epistasis and pleiotropy.
Included in this dataset are 204 male mice genotyped at 85 MIT markers across the genome. The phenotypes included are the body weight (g), insulin levels (ng/mL), and the log of plasma glucose levels (mg/dL), all measured at age 24 weeks. In addition, there is a variable called ``pgm'' indicating whether the mother of each mouse was normal weight (0) or obese (1).
These data can be accessed through the demo_qtl demonstration code, which was loaded above.
Before proceeding with an analysis we recommend exploring the data visually. cape provides a few basic functions for looking at distributions of traits and the correlations between traits.
The figure below shows distributions of three traits: body weight at 24 weeks
(BW_24), serum insulin levels (INS_24), and the log of serum glucose levels
(log_GLU_24). We have selected these traits out of all traits included in
the data set by using the pheno_which
argument. Leaving this argument
as NULL includes all traits in the plot.
hist_pheno(obesity_cross, pheno_which = c("BW_24", "INS_24", "log_GLU_24"))
Body weight looks relatively normally distributed, but glucose and insulin have obviously non-normal distributions. We can see this in a different way by looking at the Qnorm plots for each trait using `qnorm_pheno.'
qnorm_pheno(obesity_cross, pheno_which = c("BW_24", "INS_24", "log_GLU_24"))
In general we recommend mean centering and normalizing all traits before
proceeding with the analysis. Trait normalization can be achieved through log
transformation, quantile normalization, or another method before the analysis.
The function norm_pheno
uses quantile normalization to fit the
phenotypes to a normal distribution. Briefly, this process ranks the trait
values and divides by n-1. It then returns the quantiles of the normal
distribution using qnorm.
Mean centering subtracts the mean value
from each trait, and standardizing divides by the standard deviation.
obesity_cross <- norm_pheno(obesity_cross, mean_center = TRUE)
Now when we plot the distributions compared to a theoretical normal distribution, we see that our traits are much closer to normally distributed. Insulin still has a ceiling effect, which cannot be removed by normalization because rank cannot be determined among equal values.
qnorm_pheno(obesity_cross, pheno_which = c("BW_24", "INS_24", "log_GLU_24"))
Cape relies on the selection of two or more traits that have common genetic factors but are not identical across all individuals. One indirect way at getting at this is through trait correlation. Traits that are modestly correlated may have both common and divergent contributing genetic factors. This is in contrast to traits that are perfectly correlated and likely have only common genetic influences, and to traits that are completely uncorrelated and likely have divergent genetic influences.
We can look at the correlation of the traits in our data set using plot_pheno_cor
The figure below shows the correlations between all pairs of traits. Here,
we have also colored the points by the covariate "pgm" to look for some
initial trends. You can see that mice with obese mothers tended to have
higher body weight, insulin levels, and glucose levels.
The lower triangle of panels in the figure below shows the point clouds
of each pair of traits plotted against each other. The diagonal shows
the distribution of each individual trait, and the upper triangle
gives numeric information about pairwise correlations. If color_by
is not specified, only the overall pearson R values are shown for each
trait pair. If color_by
is specified, we show the overall correlation
as well as correlations for each group in color_by.
plot_pheno_cor(obesity_cross, pheno_which = c("BW_24", "INS_24", "log_GLU_24"), color_by = "pgm", group_labels = c("Non-obese", "Obese"))
The traits in this data set are all modestly correlated, which is ideal for cape. In addition, we have selected traits from a range of physiological levels. Body weight is a high-level physiological trait, whereas glucose and insulin levels are endophenotypes measured at the molecular level.
Cape measures interactions between pairs of genes across all traits with the assumption that different genetic interactions identified for a single gene pair in the context of different phenotypes represent multiple manifestations of a single underlying gene network. By measuring the interactions between genetic variants in different contexts we can gain a clearer picture of the network underlying statistical epistasis [@Carter:2012fd].
Before starting a cape run, we need to specify all the parameters for the run in a parameter file. We use a .yml file for these parameters. A .yml file holds information easily readable to both humans and computers. We have multiple examples of cape parameter files in the demo folder associated with the cape package. It is probably easiest to start with one of these and modify it to fit your data. The following sections describe each cape parameter in the yml file.
The section of general parameters is where we specify which traits we will use, whether we want cape to normalize the traits for us, and whether we want cape to save the results. This is also the section in which we tell cape whether we want to use a kinship correction or not. kinship corrections are discussed further below.
If a parameter has multiple entries, for example we always want to test multiple traits, each entry gets its own line starting with a dash.
traits: This is where the traits to be scanned are entered. Each trait is entered on its own line preceded by a dash.
covariates: Any covariates are entered here. These are originally part of the trait matrix and are designated by cape as covariates. Each covariate specified is included as an additive covariate in each model, and also tested as an interactive covariate. Here we specify "pgm" as our covariate.
scan_what: This parameter refers to whether we will scan individual traits or composite traits called eigentraits. Eigentraits are calculated by factoring the trait matrix by singular value decomposition (SVD):
$$Y = U \cdot V \cdot W^{T}$$
Where $Y$ is a matrix containing one column for each mean-centered, normalized phenotype and one row for each individual. If $Y$ contains more individuals than phenotypes, the $U$ matrix has the same dimensions as Y with each column containing one eigentrait. $V$ contains the singular values, and $W^{T}$ contains the right singular vectors in rows. See @Carter:2012fd for more details.
The SVD de-correlates the traits concentrating phenotypic features into individual eigentraits. One benefit of this process is that variants that are weakly correlated to several traits due to common underlying processes may be strongly correlated to one of the eigentraits. This eigentrait captures the information of the underlying process, making strong main effects distributed between traits easier to detect and identify as potential interaction loci and/or covariates.
To specify using individual traits, set scan_what
to raw_traits.
To specify using eigentraits, set scan_what
to eigentraits.
traits scaled: This is a logical value (true/false) indicating whether cape should mean center and standardize the traits.
traits normalized: This is a logical value indicating whether cape should normalize the traits.
eig_which: After performing the SVD to create orthogonal eigentraits, you may wish to analyze only a subset of them in cape. For example, if the first two eigentraits capture more than 90\% of the trait variance, you may wish to discard the last eigentrait. This results in a loss of some information, but may increase power to detect important trait-related signals.
To specify which eigentraits to use, enter each on its own line. This is a bit more tedious than simply entering a number of eigentraits, but offers a little more flexibility in case you would like to analyze eigentraits that don't necessarily start at eigentrait 1.
pval_correction: This is where we specify the correction for multiple
testing. It can be any of the options in p.adjust:
"holm", "hochberg",
"hommel", "bonferroni", "BH", "BY", "fdr", or "none."
use_kinship: A logical value indicating whether to use a kinship correction.
pop: This parameter is only necessary to set if use_kinship is set
to true. If use_kinship is true, pop can be one of "2PP" for a two-parent
cross, "MPP" for a multi-parent cross, or "RIL" for a recombinant inbred
line. This parameter is passed to kinship
to inform the calculation of
the kinship matrix.
save_results: A logical value indicating whether results should be written to files.
use_saved_results A logical value indicating whether cape should use data already written to file. This can be helpful if a job aborts partway through. You can use everything that has been calculated already and save time on the next run. If use_saved_results is true, only results that don't already exist will be calculated. This can cause problems if there has been a parameter change between two runs, and there is a mismatch between previously calculated results and new results. We recommend setting this to false for most cases.
transform_to_phenospace A logical value indicating that is required scanning eigentraits. If true, all effects are rotate back to trait space after calculation, such that the final networks show marker influences on traits. If false, all effects will be kept in terms of eigentraits, and all final tables and plots will show marker effects on eigentraits.
Cape performs marker regression on individual markers before running the pairwise tests. If there are more markers than can be tested reasonably in the pairwise scan, cape uses the results from the single-locus scan to select markers for the pairwise scan.
ref_allele: The reference allele. In two-parent populations, this will usually be allele A (major allele). In multi-parent populations, this may be a different allele. For example, the C57B6/J mouse is often used as a reference strain in mouse studies. It is also one of the founders of the Diversity Outbred (DO) and Collaborative Cross (CC) mice [@chesler2008collaborative], and serves as a reasonable reference. To specify the B6 mouse as the reference, use its letter code "B" as the reference allele. All alleles in cape are stored as individual letters, A through the number of alleles that are present. For example, there are eight founder alleles in the DO/CC mice that are represented by the letters A through H.
singlescan_perm: This parameter specifies the number of permutations done in the single-locus scan. It is perfectly okay to set this to 0. The permutations are not used for anything in cape. They simply allow you to make a first-pass glance at identifying markers with main effects in your data. However, if you are interested in doing the main effect scan properly, we recommend using R/qtl2 [@broman2019r].
alpha: If "singlescan_perm" is a number greater than 0, you can also
specify a significance threshold alpha. If alpha is specified, cape will
calculate an effect size threshold for each value of alpha. These can be
plotted using plot_singlescan
.
As mentioned above, the single-locus scan is primarily used to select markers for the pairwise scan. We usually do this by selecting the markers with the largest main effects. Alternatively, you can also specify markers for the pairwise scan using a text file.
marker_selection_method: This parameter should be either "top_effects", or "from_list." If "top_effects," cape will select markers for the pair scan based on their main effects across all traits in the single-locus scan. In this case, a few additional parameters need to be specified:
peak_density: To select markers for the pairscan, cape first identifies peaks in effect size across all traits, and samples markers within the peaks for further testing. "peak_density" specifies the fraction of markers that will be sampled uniformly from under a large peak. For example, if "peak_density" is set to 0.5, 50\% of markers under a peak will be selected for downstream testing. For populations with high linkage disequilibrium (LD), you might want to set this lower to get a better sampling of peaks from across the genome with less redundancy. For a population with low LD, you may want to set this parameter higher to get a better sampling of markers with large main effects.
num_alleles_in_pairscan: This parameter sets how many markers will be tested in the pairwise scan. Cape is relatively computationally intensive, and we cannot do exhaustive pairwise testing in densely genotyped populations. We suggest limiting this number to around 1500 markers. These can be tested in roughly 24 hours using 2 or 3 traits/eigentraits and 1.5 million permuted tests. Increasing the number of traits/eigentraits analyzed substantially increases the time of a run, and it might be necessary to lower the number of markers tested further if there are many traits/eigentraits in the analysis.
tolerance: In selecting markers by their main effect size, cape starts at a high effect size, and gradually lowers it until it has collected the desired number of markers. The "tolerance" gives cape some wiggle room in how many markers are selected. If "num_alleles_in_pairscan" is 100 and the "tolerance" is 5, cape will allow any number of markers between 95 and 105.
snp_file: If "marker_selection_method" is "from_list" a file name needs to be specified. The file must be in the results directory. It is a tab-delimited text file with one column. The column should contain the names of the markers to be tested in the pairscan. This can get a little tricky for multi-parent crosses. Because cape tests individual alleles and not markers in the pair scan, the markers must also have the allele name appended with an underscore ("_"). This is relatively simple for a two-parent cross: If the reference allele is A, all markers will just have a "_B" appended to their name. If, however, this is a multi-parent population, specifying the correct allele is important. Any of the alleles other than the reference can be specified for each marker. The underscore introduces another little complication, which is that no underscores are allowed in marker names except to separate the marker name from the allele. If there are underscores in marker names, cape deletes these. This means that any file specifying marker names for the pairscan should also remove all underscores that are not separating the marker name from the allele. This is a bit cumbersome, but can be completely avoided by setting "marker_selection_method" to "top_effects."
The pairwise scan has a few additional parameters to set. Because testing genetic markers in LD can lead to false positives, we refrain from testing marker pairs that are highly correlated. This (Pearson) correlation is set by the user as "max_pair_cor". We generally use a value of 0.5 in our own work, but this can be raised or lowered depending on your own assessment of the influence of LD in your population.
max_pair_cor: A numeric value between 0 and 1 indicating the maximum Pearson correlation that two markers are allowed in pairwise testing. If the correlation between a pair of markers exceeds this threshold, the pair is not tested. If this value is set to NULL, "min_per_genotype" must have a numeric value.
min_per_geno: This is an alternative way to limit the effect of LD on pairwise testing. This number sets the minimum number of individuals allowable per pairwise genotype combination. If for a given marker pair, one of the genotype combinations is underrepresented, the marker pair is not tested. Setting this parameter is not recommended if you have continuous genotype probabilities, as the number of genotype combinations will be too high. If this value is NULL, max_pair_cor must have a numeric value.
pairscan_null_size: This parameter can be a little confusing, as it is different from the number of permutations to run in the the pairwise scan. @Carter:2012fd showed that multiple tests from a single permutation can be combined to create the null distribution for cape statistics. This saves enormously on time. So instead of specifying the number of permutations, we specify the total size of the null distribution. If you are testing 100 markers, you will test at most 100 choose 2, or 4950 marker pairs. Choose a "pairscan_null_size" that is at least as big as the number of marker pairs you are testing. For smaller numbers of markers, you may want to choose a null distribution size that is many times larger than the number of pairs you are testing.
Unlike previous versions of cape, which required the user to perform many
individual steps, this version of cape can be run with a single command
run_cape
. The line below runs the complete cape pipeline for the
demo NON/NZO mouse backcross. If you run this line interactively, set verbose
to TRUE to see the progress printed to your screen. When verbose is set
to FALSE, important messages are still printed, but printing of progress
is suppressed.
Note on results: To have the results files saved so that you can
visualize them without further commands, open up the parameter file
0_NON_NZO.parameters_0.yml, and set save_results
to true
. This
will tell cape to save results files to the results directory. Otherwise,
only the final_cross
object is returned, and plots must be generated
using further commands.
To use the saved results, set use_saved_results
to true
. This
will enable you to read in and use previously saved results in case
the cape run halts before finishing.
param_file <- here("demo", "demo_qtl", "0_NON_NZO.parameters_0.yml") results_path <- here("demo", "demo_qtl") final_cross <- run_cape(obesity_cross, obesity_geno, results_file = "NON_NZO", p_or_q = 0.05, verbose = FALSE, param_file = param_file, results_path = results_path)
The benefit to this new function is obvious: cape is now much easier to run. However, it comes with a sizeable reduction in flexibility. For most users the benefit of the increased ease of use will far outweigh the decrease in flexibility.
If you do need of more flexibility, all functions in the function
run_cape
are exported, so the entire analysis can be run step-by-step
as it was in earlier versions. All plotting functions are also exported
to allow greater flexibility in plotting. Although run_cape
does some
plotting automatically, as long as save_results
is set to true in the
parameter file, you may wish to re-run some plotting functions with your
own settings.
However you choose to run cape, the cape team is very happy to answer any questions and to help with running or interpreting any cape analysis.
As described above, the first step performed by cape is typically to decompose the trait matrix into eigentraits. This is done if "scan_what" is set to "eigentraits."
This step uses singular value decomposition (SVD) to decompose the trait matrix into orthogonal eigentraits. Because we use modestly correlated traits in cape by design, the eigentrait decomposition may help concentrate correlated signals, that are otherwise distributed across traits, into individual eigentraits. This potentially increases power to detect variants associated with the common components of groups of traits.
Cape uses the function plot_svd
to plot the results of the SVD. The
plot for the NON/NZO data set are shown below.
plot_svd(final_cross)
In the example illustrated here, the first eigentrait captures 75\% of the total trait variance. This eigentrait describes the processes by which body weight, glucose levels, and insulin levels all vary together. The correlations between obesity and risk factors for obesity, such as elevated insulin and fasting glucose levels are well known [@Permutt:2005ir; @Das:2006bo; @Haffner:2003kb]. The second eigentrait captures 14\% of the variance in the phenotypes. It captures the processes through which glucose and body weight vary in opposite directions. This eigentrait may be important in distinguishing the genetic discordance between obesity and diabetes. While obesity is a strong risk factor for diabetes, not all those who are obese have diabetes, and not all those with diabetes are obese [@Permutt:2005ir; @Burcelin:2002co].
The third eigentrait is less interpretable biologically, as it describes the divergence of blood glucose and insulin levels. It may represent a genetic link between glucose and body weight that is non-insulin dependent. Because we are primarily interested in the connection between diabetes and insulin, we used only the first two eigentraits for the analysis. In many cases in which more than two phenotypes are being analyzed, the first two or three eigentraits will capture the majority of the variance in the data and capture obvious features. Other eigentraits may capture noise or systematic bias in the data. Often the amount of total variance captured by such eigentraits is small, and they can be removed from the analysis.
Ultimately, there is no universal recipe for selecting which eigentraits should be included in the analysis, and the decision will be based on how the eigentraits contribute to the original phenotypes and how much variance in the data they capture.
After computing eigentraits, we go on to perform marker regression on all individual markers:
$$U_{i}^{j} = \beta_{0}^{j} + x_{i}\beta^{j} + \epsilon_{i}^{j}$$
The index $i$ runs from 1 to the number of individuals, and $j$ runs from 1 to the number of eigentraits or traits. $x_{i}$ is the probability of the presence of the reference allele for individual $i$ at locus $j$. The primary purpose of this scan is is to select markers for the pairwise scan if there are too many to test exhaustively.
Although run_cape
creates files with the singlescan results automatically,
it is also sometimes desireable to re-plot these results with different
parameters. We show how to do that below by reading in the singlescan
results file and using plot_singlescan
with parameters different from
those in run_cape.
singlescan_obj <- readRDS(here("demo", "demo_qtl", "results", "NON_NZO_singlescan.RDS")) plot_singlescan(final_cross, singlescan_obj, line_type = "h", lwd = 2, covar_label_size = 1)
et1_fig <- here::here("vignettes", "Singlescan_ET1_Standardized.jpg") cat(paste0("![](", et1_fig, "){width=70%}\n"))
et2_fig <- here::here("vignettes", "Singlescan_ET2_Standardized.jpg") cat(paste0("![](", et2_fig, "){width=70%}\n"))
The purpose of the pairwise scan is to find interactions, or epistasis, between variants. The epistatic models are then combined across traits or eigentraits to infer a parsimonious model that takes data from all traits or eigentraits into account.
To find epistatic interactions we test the following regression model for each variant 1 and 2:
$$ U_{i}^{j} = \beta_{0}^{j} + \underbrace{\sum_{c=1}^{2}x_{c,i}\beta_{c}^{j}}{\mathrm{covariates}} + \underbrace{x{1,i}\beta_{1}^{j} + x_{2,i}\beta_{2}^{j}}{\mathrm{main\;effects}} + \underbrace{x{1,i}x_{2,i} \beta_{12}^{j}}{\mathrm{interaction}} + \epsilon{i}^{j} $$
The terms in this equation are the same as those in the equation for the single-marker scan except for the addition of the term for the interaction between the two variants being tested.
To reduce the incidence of false positives arising from genetic relatedness, cape can implement a kinship correction [@Kang:2008bx].
The relatedness matrix ($K$) is calculated using the markers on the remaining chromosomes as follows:
$$ K = \frac{G \times G^T}{n},$$
where $G$ and $n$ is the number of markers in $G$. This matrix is then used to correct the genotype and phenotype matrices for kinship using hierarchical linear models [@Kang:2008bx; @gelman2006data].
Currently cape only implements an overall kinship correction. Anecdotally, we have tested leave-two-chromosomes-out methods as an extension of the leave-one-chromosome-out method already commonly used [@cheng2013practical; @gonzales2018genome], but our results were inconsistent suggesting we need to do more research into this area. More appropriate corrections are currently a topic of research in our group and will be implemented in later iterations of cape.
The reparameterization step is where we actually do the combined analysis of pleiotropy and epistasis that cape is named for. This step reparameterizes the $\beta$ coefficients from the pairwise linear regressions across all traits/eigentraits in terms of directed influence coefficients. These directed influence coefficients are consistent across all traits, and are therefore trait-independent. Cape identifies one set of genetic interactions that explains all traits simultaneously [@Carter:2012fd].
From the pair scan, each pair of markers 1 and 2 receives a set of $\beta$
coefficients describing the main effect of each marker on each eigentrait
$j$ ($\beta^j_1$ and $\beta^j_2$) as well as the interaction effect of both
markers on each eigentrait ($\beta^j_{12}$) (See figure below). The central
idea of cape
is that these coefficients can be combined across eigentraits
and reparameterized to calculate how each pair of markers influences each
other directly and independently of eigentrait.
reparam_file <- here::here("vignettes", "reparam.png") cat(paste0("![](", reparam_file, "){width=50%}\n"))
The first step in this reparameterization is to define two new parameters ($\delta_1$ and $\delta_2$) in terms of the interaction coefficients. $\delta_1$ can be thought of as the additional genetic activity of marker 1 when marker 2 is present. Together the $\delta$ terms capture the interaction term, and are interpreted as the extent to which each marker influences the effect of the other on downstream phenotypes. For example, a negative $\delta_2$ indicates that the presence of marker 2 represses the effect of marker 1 on the phenotypes or eigentraits. The $\delta$ terms are related to the main effects and interaction effects as follows:
$$ \label{eqn:delta_mat_def} \begin{bmatrix} \beta^1_1 & \beta^1_2\ \beta^2_1 & \beta^2_2\ \vdots & \vdots \end{bmatrix} \cdot \begin{bmatrix} \delta_1\ \delta_2\ \end{bmatrix} = \begin{bmatrix} \beta^1_{12}\ \beta^2_{12}\ \vdots \end{bmatrix} $$
In multiplying out this equation, it can be seen how the $\delta$ terms influence each main effect term to give rise to the interaction terms independent of phenotype.
$$ \label{eqn:delta_def} \beta^j_1\delta_1 + \beta^j_2\delta_2 = \beta^j_{12} $$
If $\delta_1 = 0$ and $\delta_2 = 0$, there are no addition effects exerted by the markers when both are present. Substitution into the equations above shows that the interaction terms $\beta^j_{12}$ are $0$ and thus the interaction terms have no effect on the phenotype.
Alternatively, consider the situation when $\delta_1 = 1$ and $\delta_2 = 0$. The positive $\delta_1$ indicates that marker 1 should exert an additional effect when marker 2 is present. This can be seen again through substitution into equation \ref{eqn:delta_def}:
$$ \beta_j^1 = \beta_{12}^j $$
These non-zero terms show that there is an interaction effect between marker 1 and marker 2. The positive $\delta_1$ indicates that this interaction is driven through an enhanced effect of marker 1 in the presence of marker 2.
The $\delta$s are calculated by solving for equation \ref{eqn:delta_mat_def} using matrix inversion:
$$ \begin{bmatrix} \delta_1\ \delta_2\ \end{bmatrix} = \begin{bmatrix} \beta^1_1 & \beta^1_2\ \beta^2_1 & \beta^2_2\ \vdots & \vdots \end{bmatrix}^{-1} \cdot \begin{bmatrix} \beta^1_{12}\ \beta^2_{12}\ \vdots \end{bmatrix} $$
This inversion is exact for two eigentraits, and cape
implements
pseudo-inversion for up to 12 eigentraits. We have observed that maximum
power to detect interactions is achieved with three to five eigentraits
[@tyler2017epistatic].
The $\delta$ terms are then translated into directed variables defining the marker-to-marker influences $m_{12}$ and $m_{21}$. Whereas $\delta_2$ described the change in activity of marker 2 in the presence of marker 1, $m_{12}$ can be thought of as the direct influence of marker 2 on marker 1, with negative values indicating repression and positive values indicating enhancement. The terms $m_{12}$ and $m_{21}$ are self-consistent and defined in terms of $\delta_1$ and $\delta_2$:
$$ \delta_1 = m_{12}(1 + \delta_2),\;\delta_2 = m_{21}(1 + \delta_1) $$
Rearranging these equations yields the solutions:
$$ m_{12} = \frac{\delta_1}{1 + \delta_2},\;m_{21} = \frac{\delta_2}{1 + \delta_1}. $$
These directed influence variables provide a map of how each marker influences each other marker independent of phenotype. The significance of these influences is determined through standard error analysis on the regression parameters [@bevington1994data; @Carter:2012fd]. This step is particularly important as matrix inversion can lead to large values but larger standard errors, yielding insignificant results. As an example, the variance of $m_{12}$ is calculated by differentiating with respect to all model parameters:
$$ \sigma^{2}{m{12}} \cong \sum_{ij} \sigma^2_{\beta_{i}^{j}} \Bigg(\frac{\partial m_{12}} {\partial \beta_{i}^{j}}\Bigg)^2 + 2 \sum_{i<k, j < l} \sigma^2_{\beta^{j}{i}\beta^{l}{k}} \Bigg(\frac{\partial m_{12}} {\partial \beta^{j}{i}} \Bigg) \Bigg(\frac{\partial m{12}}{\partial \beta^{l}_{k}} \Bigg) $$
In this equation, the indices $i$ and $k$ run over regression parameters and $j$ and $l$ run from 1 to the number of traits.
In addition to identifying directed influences between genetic markers, cape also calculates main effects. These effects are a little different from the main effects we typically calculate. The main effect of a locus calculated in the usual way is the main effect averaged across all genetic backgrounds. In cape, the main effect of a marker is derived from the set of all pairwise regressions that included that marker. We select the maximum main effect exerted by that marker across all pairwise tests. That is, the main effect that is reported by cape is a main effect conditioned on another marker or covariate. The conditioning markers and covariates are reported along with the main effects in the VariantInfluences.csv file.
If eigentraits were scanned, the main effects can be recast in terms of the original traits. This step is performed by multiplying the coefficient matrices are by the singular value matrices $V \cdot W^{T}$. With two phenotypes and two eigentraits this conversion results in no loss of information. The translation back to phenotype space does not affect the interactions between markers.
The primary output of cape is the file VariantInfluences.csv. This
file holds information about the interactions and main effects
identified in the data. Another file Variant.Influences.Interactions.csv,
excludes the main effects. Please see the documentation for the
function write_variant_influences
for a description of the columns
in these files.
Cape results can also be plotted in multiple different forms. The function
plot_variant_influences
plots the results as a heatmap. Interactions
between genetic loci are shown in the main part of the graph and
main effects are shown along the right-hand side.
The direction of the influences is read from the $y$-axis to the $x$-axis. For example, there is a large blue block near the top of the figure indicating that multiple markers on chromosome 1 suppress the phenotypic effects of markers on chromosome 12. To see the main effects of markers on chromosome 1, follow the chromosome 1 line all the way to the right to see that markers on chromosome 1 increase levels of all three traits.
plot_variant_influences(final_cross, show_alleles = FALSE)
Positive effects are shown in brown, and negative effects are in blue. Marker pairs that were not tested due to high correlation are shown in gray. Chromosome boundaries and labels are shown along the $x$ and $y$ axes.
The function plot_network
plots cape results as a circular network.
Here chromosomes are arranged in a circle with traits in concentric
circles around the chromosomes. Main effects are depicted in these
concentric circles and genetic interactions are shown as arrows within
the circle.
plot_network(final_cross)
And finally, the function plot_full_network
This function plots
the same results in a network that ignores genomic position.
This view accentuates the structure of the genetic interaction
network.
Each node is one genetic locus or covariate. Each is plotted as a pie chart with one trait per section. Significant effects are indicated by coloring the sections corresponding to the traits either brown for positive main effects or blue for negative main effects. Gray indicates no significant main effect. Interactions are shown as arrows between the nodes.
plot_full_network(final_cross, zoom = 1.2, node_radius = 0.3, label_nodes = TRUE, label_offset = 0.4, label_cex = 0.5, bg_col = "lightgray", arrow_length = 0.1, layout_matrix = "layout_with_kk", legend_position = "topright", edge_lwd = 1, legend_radius = 2, legend_cex = 0.7, xshift = -1)
Both the network figure and the adjacency matrix show direct influences of markers on the phenotypes as well as interactions between markers. As expected, many NZO variants on multiple chromosomes show positive effects on plasma insulin and glucose levels as well as on body weight.
We can plot individual interactions from the Variant Influences table
to get a more detailed look at some of these interactions. There is a
new plotting function in cape called plot_effects
that can be used
to plot the phenotypic effects of individual interactions in multiple
different styles.
For example, there is a positive interaction between Chrs 2 and 15 in which the presence of the NZO allele on Chr 2 enhances the positive influence of the NZO allele on all traits.
We can look at both the main effects and the interaction effects using
plot_effects
. First to verify the positive effect of the marker on Chr 15,
we can plot it by itself. Here we use plot_type = l
which generates a line
plot.
plot_effects(data_obj = final_cross, geno_obj = obesity_geno, marker1 = "D15Mit72_B", marker1_label = "Chr15", plot_type = "l", error_bars = "se")
We can then look at both markers together to see the enhanced effect of this allele when the Chr 2 NZO allele is present.
plot_effects(data_obj = final_cross, geno_obj = obesity_geno, marker1 = "D2Mit120_B", marker2 = "D15Mit72_B", marker1_label = "Chr2", marker2_label = "Chr15", plot_type = "l", error_bars = "se")
We can see from this plot that the effect of the Chr 15 NZO allele is negligible when the Chr 2 NZO allele is present, but is quite substantial when the Chr 2 NZO allele is present. Thus the NZO allele on Chr 2 enhances the positive effect of the Chr 15 allele across all traits.
We can look at this with other types of visualization. The bar plot below shows the same data in a different form. We can see here that the Chr 2 allele has a negative effect on the trait, while the Chr 15 allele has a small positive effect. However, when both NZO alleles are present, all trait values are higher than expected from the additive effects alone.
plot_effects(data_obj = final_cross, geno_obj = obesity_geno, marker1 = "D2Mit120_B", marker2 = "D15Mit72_B", marker1_label = "Chr2", marker2_label = "Chr15", plot_type = "b", error_bars = "se")
There are several other plotting types that are worth exploring when looking at different types of interactions, including heat maps, and points for individual values.
These findings illustrate how cape
is designed to find interactions that
simultaneously model all phenotypes under the assumption that interactions
between variants across multiple contexts represent a single underlying
interaction network. Thus we recommend users assess single-phenotype epistasis
using functions in cape
or in parallel analyses using tools such as R/qtl2
and R/qtlbim [@rqtlbim].
The following list describes the plotting functions available in cape. They are ordered by where you would probably use them in a cape analysis.
hist_pheno: Plots trait distributions as histograms. Helps identify whether traits are normally distributed.
qnorm_pheno: Plots trait quantiles compared to theoretical normal quantiles. Helps identify whether traits are normally distributed.
plot_pheno_cor: Plots the correlation between pairs of traits. Includes individual trait distributions, correlation plots, and correlation values. Points can be color-coded by a covariate or another trait.
plot_svd: Plots the results of the singular value decomposition that generates the eigentraits. Shows the total trait variance explained by each eigentrait, as well as the contribution of each trait to each eigentrait.
plot_singlescan: Plots the results of the single-locus scan. Can plot either standardized effects or non-standardized effects. Also plots effects both by allele and by marker.
plot_pairscan: Plots the results of the pairwise scan.
plot_variant_influences: Plots cape results as a heatmap. Genetic interactions are plotted in the main part of the heatmap, and main effects are plotted along the right-hand side. Positive and negative effects are distinguished by color. This plot is useful for identifying overall trends in interaction networks, particularly dense interaction networks.
plot_network: Plots cape results as a circular network. Genetic markers are laid out in a circle with traits in concentric circles around the chromosomes. Main effects are indicated in the trait circles. Interactions are drawn as arrows between genetic markers or between genetic markers and covariates. This view is good for identifying patterns on a finer scale than the heat map view, particularly for sparse networks. This view is not very informative for dense networks.
plot_full_network: Plots cape results in a standard network layout. Each marker is a node, and its main effects are shown in a pie chart in which each section of the pie is a trait. Interactions are shown as arrows between nodes. This view is good for looking at overall network structure independent of genomic position. This view can be informative both for dense and sparse networks. The function contains many arguments for adjusting the layout of the network for better visualization.
plot_effects: Plots the effects of individual interactions on traits. This function is for exploring individual interactions in more detail. It can plot main effects and interaction effects as lines, points, bars, or heatmaps. To select individual interactions to plot, use the file plot.variant.influences.csv or Variant.Influences.Interactions.csv. The function plot_variant_influences can also return these tables invisibly without writing to a file.
The following list describes each output file produced by run_cape
.
Where we use cross
below, this stands in for the user-defined
experiment name. The default experiment name in cape is "cross."
cross_geno.RData: The genotype object used in the cape run. This will have any markers on the X, Y, or mitochondrial chromosomes removed, and any underscores removed from marker names. This genotype object can be used in post-cape plotting functions.
cross.pairscan.RData: The results of the pairwise scan. It consists of a list with the following five elements: * ref_allele: The allele used as the reference for the tests. * max_pair_cor: The maximum pairwise correlation between marker pairs * pairscan_results: A list with one element per trait. The element for each trait is a list of the following three elements: * pairscan_effects: the effect sizes from the linear models * pairscan_se: the standard errors from the linear models * model_covariance: the model covariance from the linear models. * pairscan_perm: The same structure as pairscan_results, but for the permuted data. * pairs_tested_perm: A matrix of the marker pairs used in the permutation tests.
cross.parameters.yml The parameter file dictating all the parameters for the cape run.
cross.RData The data object. This object holds all the trait data, and cape results. It is used as input for almost all cape functions.
cross.singlescan.RData: The results of the single-locus scan. The results are a list of the following seven elements: * alpha: The alpha values set in the argument alpha * alpha_thresh: The calculated effect size thresholds at each alpha if permutations are run. * ref_allele: The allele used as the reference allele * singlescan_effects: The effect sizes (beta coefficients) from the single-locus linear models * singlescan_t_stats: The t statistics from the single-locus linear models * locus.p_vals: Marker-level p values * locus_score_scores: Marker-level test statistics.
Network.Circular.jpg A jpg file containing the circular cape results
plot generated by plot_network
Network.Circular.pdf: A pdf file containing the circular cape results
plot generated by plot_network
Network.View.jpg: A jpg file containing the cape results plot
generated by plot_full_network
.
Network.View.pdf: A pdf file containing the cape results plot
generated by plot_full_network
.
Pairscan.Regression.jpg: A jpg file containing the results
of the pairwise scan plotted by plot_pairscan
.
Pairscan.Regression.pdf: A pdf file containing the results
of the pairwise scan plotted by plot_pairscan
.
permutation.data.RData The results of the permutations from the
function pairscan
.
Singlescan.Effects.jpg The results of the single-locus scan. All
allele effects are plotted in this plot, as well as the effects for each
locus overall. Multiple allele effects are only plotted for multi-parent
populations. Otherwise, the allele plot looks similar to the locus plot.
There will be one singlescan effects plot for each trait, and the trait
name will be included in the filename. This plot is generated by plot_singlescan
.
Singlescan.ET1.Standardized.jpg: The results of the single-locus scan.
In contrast to the file above, this plot shows only the standardized allele
effects for each locus. Similar to the file above, there will be one effects
plot for each trait, and the trait name will be included in the filename.
This plot is generated by plot_singlescan
.
svd.jpg: A jpg file showing the results of the singular value
decomposition of the traits, if that step was performed. The plot
is generated by plot_svd
. It shows the total trait variance
explained by each eigentrait, as well as the contribution of
each trait to each eigentrait.
svd.pdf A pdf version of svd.jpg.
Variant.Influences.csv: A csv file holding the results of cape, and written by the function `write_variant_influences' This file contains both main effects and interactions.
Variant.Influences.Interactions.csv: A csv file holding the results of cape, and written by the function `write_variant_influences' This file contains only the interactions.
variant.influences.jpg: A jpg showing cape results as plotted by
plot_variant_influences
.
variant.influences.pdf: A pdf showing cape results as plotted by
plot_variant_influences
.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.