Functional Genomics by Expectation Maximization (FGEM) takes a list of gene-level bayes factors and gene-level annotations those genes, and returns the posterior probability for the genes and effect sizes for the annotations. For more info on the FGEM statistical model, read on. For more info on FGEM the R package, skip to the "usage" section.
For each of the $I$ genes being tested $B_i$ is the gene-level Bayes Factor from an association study for the trait of interest. For each of the $I$ genes being tested we also have $J$ functional annotations. These annotations can be a mix of almost any type of data. They could be binary data (e.g presence of absence of a particular GO term), count data (e.g number of exons), or continuous data (e.g level of conservation across mammals). These annotations make up the matrix $A$, which is $IxJ$.
For each gene, $Z_i=1$ indicates that gene $i$ is involved in the trait of interest, and $Z_i=0$ indicates that is not. If we knew $Z_i$ we could use logistic regression to learn the importance of each annotation in the trait of interest: $$logit(P(Z_i=1))=A_i\beta$$
However, we do not know $Z_i$. We can instead use Empirical Bayes. If we rewrite our bayes factor as the probability of observing the genotype data given the gene is causal divided by the probability of observing the genotype data given that the gene is not causal, then we can write the following likelihood function.
$$ P(x|\beta)=\prod_iP(x_i|\beta)=\prod_i [\pi_i(\beta)P(x_i|Z_i=1)+(1-\pi_i(\beta))P(x_i|Z_i=0)]$$
Remembering the definition of the Bayes factor from, this is equivalent to
$$P(x|\beta) \propto \prod_i[\pi_i(\beta)B_i+(1-\pi_i(\beta))]$$ And we can use maximum likelihood here to make estimates for $\beta$
Using Bayes rule, the posterior is $$P(Z_i=1|x)=\frac{P(x|Z_i=1)P(Z_i=1)}{P(x)}=\frac{\pi_i(\beta)B_i}{\pi_i(\beta)B_i+(1-\pi_i(\beta))}$$
The three main user-facing functions for FGEM
are the functions FGEM_df
, gen_prior
and gen_posterior
FGEM_df
FGEM_df
takes these parameters:
feat_df
: A dataframe with data. feat_df
must have a column called BF
specifying bayes factors. Any additional columns are treated as annotations.prior_mean
: A scalar between 0 and 1 specifying the starting prior probability (this is for initializing the EM algorithm). If unspecified, it defaults to 0.2.null_features
: This is a character vector specifying which features should be included in the null model. The default is to only include an intercept term. Note that if there is no column named Intercept
, and null_features
is not specified (or the default is used), an intercept feature is added.Beta0
: A vector specifying starting guesses for effect sizes of annotations. In almost all cases the default value (NULL
) should be keptFGEM_df
returns a dataframe that has, for each annotation, it's name (taken from the columns of feat_df
)
gen_prior
and gen_posterior
gen_prior
both take 2 arguments
feat_df
: same as FGEM_df
(gen_prior
does not require that feat_df
have a BF
column)fgem_result_df
the output from FGEM_df
The package comes with an example list of Bayes Factors (BFdata
) as well as example continuous annotations in the form of ExAC conservation scores (exacdf
), and binary annotations in the form of Biological Process Gene Ontoloy membership (BPGOdf
).
library(fgem) library(dplyr) library(tidyr) data("BFdata") data("BPGOdf") BFdata <- select(BFdata,Gene,BF) head(BFdata)
data("exacdf") head(exacdf)
data("BPGOdf") head(BPGOdf)
Let's look at the effect of lof_z
from ExAC
lof_df <- filter(exacdf,feature=="lof_z") %>% spread(feature,value) %>% select(-class) %>% inner_join(BFdata) result_lof <- FGEM_df(lof_df)
FGEM_df
returns a one row dataframe with a summary of the fit
result_lof
To get the values of Beta
for each feature:
select(result_lof,data) %>% unnest()
To get the prior and posterior from the model
result_lof <- gen_prior(lof_df,result_lof) %>% inner_join(gen_posterior(lof_df,result_lof)) arrange(result_lof,posterior) %>% head()
Let's look at lof_z
,mis_z
,pLI
, and 5 random GO terms
exac_features <-tibble::tibble(feature=c("lof_z","mis_z","pLI")) exac_sub_df <- semi_join(exacdf,exac_features) %>% spread(feature,value) %>% select(-class) %>% inner_join(BFdata) GO_features <- distinct(BPGOdf,feature) %>% sample_n(5) feature_df <- semi_join(BPGOdf, GO_features) %>% spread(feature, value, fill = 0) %>% select(-class) %>% right_join(exac_sub_df) %>% mutate_all(function(x) ifelse(is.na(x), 0, x)) fit_df <- FGEM_df(feature_df, use_squarem = FALSE) ofit_df <- FGEM_df(feature_df, use_squarem = TRUE) result_df <- gen_prior(feature_df,fit_df) %>% inner_join(gen_posterior(feature_df,fit_df)) arrange(result_df,posterior) %>% head()
Let's add gene size (bp
) as a feature, but we'll also add it to the null hypothesis
exac_features <-tibble::tibble(feature=c("lof_z","mis_z","pLI","bp")) exac_sub_df <- semi_join(exacdf,exac_features) %>% spread(feature,value) %>% select(-class) %>% inner_join(BFdata) GO_features <- distinct(BPGOdf,feature) %>% sample_n(5) feature_df <- semi_join(BPGOdf,GO_features) %>% spread(feature,value,fill=0) %>% select(-class) %>% right_join(exac_sub_df) %>% mutate_all(function(x)ifelse(is.na(x),0,x)) fit_df <- FGEM_df(feature_df,null_features = c("bp","Intercept")) result_df <- gen_prior(feature_df,fit_df) %>% inner_join(gen_posterior(feature_df,fit_df)) arrange(result_df,posterior) %>% head()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.