Introduction

Functional Genomics by Expectation Maximization (FGEM) takes a list of gene-level bayes factors and gene-level annotations those genes, and returns the posterior probability for the genes and effect sizes for the annotations. For more info on the FGEM statistical model, read on. For more info on FGEM the R package, skip to the "usage" section.

Model

Data

For each of the $I$ genes being tested $B_i$ is the gene-level Bayes Factor from an association study for the trait of interest. For each of the $I$ genes being tested we also have $J$ functional annotations. These annotations can be a mix of almost any type of data. They could be binary data (e.g presence of absence of a particular GO term), count data (e.g number of exons), or continuous data (e.g level of conservation across mammals). These annotations make up the matrix $A$, which is $IxJ$.

Model

For each gene, $Z_i=1$ indicates that gene $i$ is involved in the trait of interest, and $Z_i=0$ indicates that is not. If we knew $Z_i$ we could use logistic regression to learn the importance of each annotation in the trait of interest: $$logit(P(Z_i=1))=A_i\beta$$

However, we do not know $Z_i$. We can instead use Empirical Bayes. If we rewrite our bayes factor as the probability of observing the genotype data given the gene is causal divided by the probability of observing the genotype data given that the gene is not causal, then we can write the following likelihood function.

$$ P(x|\beta)=\prod_iP(x_i|\beta)=\prod_i [\pi_i(\beta)P(x_i|Z_i=1)+(1-\pi_i(\beta))P(x_i|Z_i=0)]$$

Remembering the definition of the Bayes factor from, this is equivalent to

$$P(x|\beta) \propto \prod_i[\pi_i(\beta)B_i+(1-\pi_i(\beta))]$$ And we can use maximum likelihood here to make estimates for $\beta$

Using Bayes rule, the posterior is $$P(Z_i=1|x)=\frac{P(x|Z_i=1)P(Z_i=1)}{P(x)}=\frac{\pi_i(\beta)B_i}{\pi_i(\beta)B_i+(1-\pi_i(\beta))}$$

Usage

The three main user-facing functions for FGEM are the functions FGEM_df, gen_prior and gen_posterior

FGEM_df

FGEM_df takes these parameters:

FGEM_df returns a dataframe that has, for each annotation, it's name (taken from the columns of feat_df)

gen_prior and gen_posterior

gen_prior both take 2 arguments

Examples

The package comes with an example list of Bayes Factors (BFdata) as well as example continuous annotations in the form of ExAC conservation scores (exacdf), and binary annotations in the form of Biological Process Gene Ontoloy membership (BPGOdf).

library(fgem)
library(dplyr)
library(tidyr)
data("BFdata")
data("BPGOdf")
BFdata <- select(BFdata,Gene,BF)
head(BFdata)
data("exacdf")
head(exacdf)
data("BPGOdf")
head(BPGOdf)

Single feature analysis

Let's look at the effect of lof_z from ExAC

lof_df <- filter(exacdf,feature=="lof_z") %>% 
  spread(feature,value) %>% 
  select(-class) %>% 
  inner_join(BFdata)

result_lof <- FGEM_df(lof_df)

FGEM_df returns a one row dataframe with a summary of the fit

result_lof

To get the values of Beta for each feature:

select(result_lof,data) %>% unnest()

To get the prior and posterior from the model

result_lof <- gen_prior(lof_df,result_lof) %>% inner_join(gen_posterior(lof_df,result_lof))
arrange(result_lof,posterior) %>% head()

Multiple feature analysis

Let's look at lof_z,mis_z,pLI, and 5 random GO terms

exac_features <-tibble::tibble(feature=c("lof_z","mis_z","pLI"))
exac_sub_df <- semi_join(exacdf,exac_features) %>% 
  spread(feature,value) %>% 
  select(-class) %>% 
  inner_join(BFdata)

GO_features <- distinct(BPGOdf,feature) %>% sample_n(5)
feature_df <- semi_join(BPGOdf, GO_features) %>%
    spread(feature, value, fill = 0) %>%
    select(-class) %>%
    right_join(exac_sub_df) %>%
    mutate_all(function(x) ifelse(is.na(x), 0, x))

fit_df <- FGEM_df(feature_df, use_squarem = FALSE)
ofit_df <- FGEM_df(feature_df, use_squarem = TRUE)

result_df  <- gen_prior(feature_df,fit_df) %>% inner_join(gen_posterior(feature_df,fit_df))
arrange(result_df,posterior) %>% head()

Competetive null hypothesis

Let's add gene size (bp) as a feature, but we'll also add it to the null hypothesis

exac_features <-tibble::tibble(feature=c("lof_z","mis_z","pLI","bp"))

exac_sub_df <- semi_join(exacdf,exac_features) %>% 
  spread(feature,value) %>% 
  select(-class) %>% 
  inner_join(BFdata)

GO_features <- distinct(BPGOdf,feature) %>% sample_n(5)
feature_df <- semi_join(BPGOdf,GO_features) %>% 
  spread(feature,value,fill=0) %>% 
  select(-class) %>% 
  right_join(exac_sub_df) %>% 
  mutate_all(function(x)ifelse(is.na(x),0,x))

fit_df <- FGEM_df(feature_df,null_features = c("bp","Intercept"))

result_df  <- gen_prior(feature_df,fit_df) %>% inner_join(gen_posterior(feature_df,fit_df))
arrange(result_df,posterior) %>% head()


CreRecombinase/FGEM documentation built on July 17, 2020, 5:30 a.m.