knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE )
The sparse marginal epistasis (SME) test performs a genome-wide search for SNPs involved in genetic interactions while conditioning on information derived from functional genomic data.
By examining one SNP at a time, SME fits a linear mixed model to test for marginal epistasis. It explicitly models the combined additive effects from all SNPs a marginal epistatic effect that represents pairwise interactions involving a test SNP.
The key to the SME formulation is that the interaction between the test SNP and other SNPs may be masked depending on additional information. Masking interactions that do not contribute to trait variance both maximizes the power of the inference as well as realizes the computational efficiency needed to analyze human Biobank scale data.
This pages presents a toy example to show case what insights the implementation of the test provides.
Figure 1. SME performs a genome-wide search for SNPs involved in genetic
interactions while conditioning on information derived from functional genomic
data. SME has improved power to detect marginal epistasis and runs 10x to 90x
faster than state-of-the-art methods.
library(smer) library(dplyr) library(ggplot2)
SME is implemented in R and requires your genetic data to be in PLINK format, which consists of three files:
.bim
: Contains SNP information.bed
: Contains genetic data.fam
: Contains sample informationNote: sme()
cannot handle missing genotypes. Prior to running sme()
,
use PLINK to remove variants with missing genotypes or impute them.
Additionally, your phenotype (trait) data should be in a separate file following PLINK's phenotype format.
The sme()
function includes parameters that let you control memory usage and
computational resources. For detailed guidance on optimizing these settings for
your system, please see our tutorial
How To Optimize the Memory Requirements of SME.
When selecting which SNPs to analyze, you must provide their positions as
1-based indices. These indices correspond to the row numbers in your
.bim
file, where the first SNP is index 1, the second is index 2, and so on.
For complete details about all function parameters, please refer to the
documentation of the sme()
function.
# File inputs plink_file <- "path/to/plink/file" pheno_file <- "path/to/pheno/file" mask_file <- "path/to/mask/file" # Parameter inputs chun_ksize <- 10 n_randvecs <- 10 n_blocks <- 10 n_threads <- 5 # 1-based Indices of SNPs to be analyzed n_snps <- 100 snp_indices <- 1:n_snps sme_result <- sme( plink_file, pheno_file, mask_file, snp_indices, chunk_size, n_randvecs, n_blocks, n_threads )
sme_result <- getting_started
This analysis uses simulated data for demonstration purposes. We simulated synthetic phenotypes from 5000 synthetic genotypes with 6000 SNPs. If you would like to learn how to simulate data, please refer to our tutorial How To Simulate Traits.
When you call the sme()
function, it returns a list containing multiple
elements. The most important one is called summary
, which contains the main
analysis results. These results are formatted as tidy data, making them
compatible with popular R packages like ggplot2
and dplyr
for further
analysis and visualization.
We use Manhattan plots to visualize genome-wide analyses because they effectively highlight strong associations between genetic variants and traits. In this case, the Manhattan plot specifically shows statistical epistasis (interactions between genes). For reference, we've marked the true causal SNPs (Single Nucleotide Polymorphisms) in green on the plot - these are the genetic variants we included in our simulation as having real effects.
sme_result$summary %>% ggplot(aes( x = index, y = -log10(p), color = true_gxg_snp )) + geom_point() + xlab("Position") + labs(color = "Epistatic SNP")
SME estimates how much of the total trait variation can be explained by genetic interactions. In this simulation, we set the Phenotypic Variance Explained (PVE) to 5% for each SNP involved in epistatic interactions.
The plot below shows how well our method recovered these effects. It displays two distributions:
The dashed line marks the true 5% PVE level we used in the simulation, allowing you to see how accurately the method estimated the actual effect sizes.
sme_result$summary %>% ggplot(aes(x = true_gxg_snp, y = pve, fill = true_gxg_snp)) + geom_boxplot() + geom_hline(yintercept = 0.05, color = "grey40", linetype = "dashed") + annotate("text", x = 0.8, y = 0.055, label = "True per SNP epistatic PVE", color = "black") + xlab("Epistatic SNP") + ylab("Phenotypic Variance Explained") + theme(legend.position = "none")
The SME method uses a linear mixed model to separate different sources of trait variation. One key component it estimates is narrow sense heritability ($h^2$), which measures how much trait variation can be explained by additive genetic effects.
The plot below breaks down the estimated sources of trait variation:
In our simulation, we set the true narrow sense heritability to 30%, shown as a dashed line in the plot. This reference line helps you evaluate how accurately SME estimated the genetic components of trait variation.
sme_result$vc_estimate %>% ggplot(aes(x = component, y = vc_estimate, fill = component)) + geom_boxplot() + geom_hline(yintercept = 0.3, color = "grey40", linetype = "dashed") + annotate("text", x = 0.7, y = 0.33, label = expression("True " * h^2), color = "black") + xlab("Component") + ylab("Variance Component Estimate") + theme(legend.position = "none")
The estimate of the narrow-sense heritability $h^2$ is much less variable because it is always informed by the same genetic relatedness matrix. In this small data example it overestimates the heritability but it is unbiased in general.
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.