In CreRecombinase/FGEM: R package for Functional Genomics using Expectation Maximization (FGEM)

Introduction

Functional Genomics by Expectation Maximization (FGEM) takes a list of gene-level bayes factors and gene-level annotations those genes, and returns the posterior probability for the genes and effect sizes for the annotations. For more info on the FGEM statistical model, read on. For more info on FGEM the R package, skip to the "usage" section.

Model

Data

For each of the $I$ genes being tested $B_i$ is the gene-level Bayes Factor from an association study for the trait of interest. For each of the $I$ genes being tested we also have $J$ functional annotations. These annotations can be a mix of almost any type of data. They could be binary data (e.g presence of absence of a particular GO term), count data (e.g number of exons), or continuous data (e.g level of conservation across mammals). These annotations make up the matrix $A$, which is $IxJ$.

Model

For each gene, $Z_i=1$ indicates that gene $i$ is involved in the trait of interest, and $Z_i=0$ indicates that is not. If we knew $Z_i$ we could use logistic regression to learn the importance of each annotation in the trait of interest: $$logit(P(Z_i=1))=A_i\beta$$

However, we do not know $Z_i$. We can instead use Empirical Bayes. If we rewrite our bayes factor as the probability of observing the genotype data given the gene is causal divided by the probability of observing the genotype data given that the gene is not causal, then we can write the following likelihood function.

$$ P(x|\beta)=\prod_iP(x_i|\beta)=\prod_i [\pi_i(\beta)P(x_i|Z_i=1)+(1-\pi_i(\beta))P(x_i|Z_i=0)]$$

Remembering the definition of the Bayes factor from, this is equivalent to

$$P(x|\beta) \propto \prod_i[\pi_i(\beta)B_i+(1-\pi_i(\beta))]$$ And we can use maximum likelihood here to make estimates for $\beta$

Using Bayes rule, the posterior is $$P(Z_i=1|x)=\frac{P(x|Z_i=1)P(Z_i=1)}{P(x)}=\frac{\pi_i(\beta)B_i}{\pi_i(\beta)B_i+(1-\pi_i(\beta))}$$

Usage

The three main user-facing functions for FGEM are the functions FGEM_df, gen_prior and gen_posterior

`FGEM_df`

FGEM_df takes these parameters:

feat_df: A dataframe with data. feat_df must have a column called BF specifying bayes factors. Any additional columns are treated as annotations.
prior_mean: A scalar between 0 and 1 specifying the starting prior probability (this is for initializing the EM algorithm). If unspecified, it defaults to 0.2.
null_features: This is a character vector specifying which features should be included in the null model. The default is to only include an intercept term. Note that if there is no column named Intercept, and null_features is not specified (or the default is used), an intercept feature is added.
Beta0: A vector specifying starting guesses for effect sizes of annotations. In almost all cases the default value (NULL) should be kept

FGEM_df returns a dataframe that has, for each annotation, it's name (taken from the columns of feat_df)

`gen_prior` and `gen_posterior`

gen_prior both take 2 arguments

feat_df: same as FGEM_df (gen_prior does not require that feat_df have a BF column)
fgem_result_df the output from FGEM_df

Examples

The package comes with an example list of Bayes Factors (BFdata) as well as example continuous annotations in the form of ExAC conservation scores (exacdf), and binary annotations in the form of Biological Process Gene Ontoloy membership (BPGOdf).

library(fgem)
library(dplyr)
library(tidyr)
data("BFdata")
data("BPGOdf")
BFdata <- select(BFdata,Gene,BF)
head(BFdata)

data("exacdf")
head(exacdf)

data("BPGOdf")
head(BPGOdf)

Single feature analysis

Let's look at the effect of lof_z from ExAC

lof_df <- filter(exacdf,feature=="lof_z") %>% 
  spread(feature,value) %>% 
  select(-class) %>% 
  inner_join(BFdata)

result_lof <- FGEM_df(lof_df)

FGEM_df returns a one row dataframe with a summary of the fit

result_lof

To get the values of Beta for each feature:

select(result_lof,data) %>% unnest()

To get the prior and posterior from the model

result_lof <- gen_prior(lof_df,result_lof) %>% inner_join(gen_posterior(lof_df,result_lof))
arrange(result_lof,posterior) %>% head()

Multiple feature analysis

Let's look at lof_z,mis_z,pLI, and 5 random GO terms

exac_features <-tibble::tibble(feature=c("lof_z","mis_z","pLI"))
exac_sub_df <- semi_join(exacdf,exac_features) %>% 
  spread(feature,value) %>% 
  select(-class) %>% 
  inner_join(BFdata)

GO_features <- distinct(BPGOdf,feature) %>% sample_n(5)
feature_df <- semi_join(BPGOdf, GO_features) %>%
    spread(feature, value, fill = 0) %>%
    select(-class) %>%
    right_join(exac_sub_df) %>%
    mutate_all(function(x) ifelse(is.na(x), 0, x))

fit_df <- FGEM_df(feature_df, use_squarem = FALSE)
ofit_df <- FGEM_df(feature_df, use_squarem = TRUE)

result_df  <- gen_prior(feature_df,fit_df) %>% inner_join(gen_posterior(feature_df,fit_df))
arrange(result_df,posterior) %>% head()

Competetive null hypothesis

Let's add gene size (bp) as a feature, but we'll also add it to the null hypothesis

exac_features <-tibble::tibble(feature=c("lof_z","mis_z","pLI","bp"))

exac_sub_df <- semi_join(exacdf,exac_features) %>% 
  spread(feature,value) %>% 
  select(-class) %>% 
  inner_join(BFdata)

GO_features <- distinct(BPGOdf,feature) %>% sample_n(5)
feature_df <- semi_join(BPGOdf,GO_features) %>% 
  spread(feature,value,fill=0) %>% 
  select(-class) %>% 
  right_join(exac_sub_df) %>% 
  mutate_all(function(x)ifelse(is.na(x),0,x))

fit_df <- FGEM_df(feature_df,null_features = c("bp","Intercept"))

result_df  <- gen_prior(feature_df,fit_df) %>% inner_join(gen_posterior(feature_df,fit_df))
arrange(result_df,posterior) %>% head()

CreRecombinase/FGEM documentation built on July 17, 2020, 5:30 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

CreRecombinase/FGEM
R package for Functional Genomics using Expectation Maximization (FGEM)

In CreRecombinase/FGEM: R package for Functional Genomics using Expectation Maximization (FGEM)

Introduction

Model

Data

Model

Usage

`FGEM_df`

`gen_prior` and `gen_posterior`

Examples

Single feature analysis

Multiple feature analysis

Competetive null hypothesis

R Package Documentation

Browse R Packages

We want your feedback!

CreRecombinase/FGEM R package for Functional Genomics using Expectation Maximization (FGEM)

In CreRecombinase/FGEM: R package for Functional Genomics using Expectation Maximization (FGEM)

Introduction

Model

Data

Model

Usage

FGEM_df

gen_prior and gen_posterior

Examples

Single feature analysis

Multiple feature analysis

Competetive null hypothesis

R Package Documentation

Browse R Packages

We want your feedback!

CreRecombinase/FGEM
R package for Functional Genomics using Expectation Maximization (FGEM)

`FGEM_df`

`gen_prior` and `gen_posterior`