run_mash | R Documentation |
Run mash analysis on dreamlet results
run_mash(fit, coefList)
fit |
result from |
coefList |
coefficient to be analyzed. Assumes 1) the null distribution of the two coefficients is simular, 2) the effects sizes are on the same scale, and 3) the effect estimates should be shrunk towards each other. If these are not satisfied, run separately on each coefficient |
Apply mashr analysis \insertCiteurbut2019flexibledreamlet on the joint set of coefficients for each gene and cell type. mashr
is a Bayesian statistical method that borrows strength across tests (i.e. genes and cell types) by learning the distribution of non-zero effects based the obesrved logFC and standard errors. The method then estimates the posterior distributions of each coefficient based on the observed value and the genome-wide emprical distribution.
mashr
has been previously applied to differential expression in GTEx data using multiple tissues from the same set of donors \insertCiteoliva2020impactdreamlet.
In single cell data, a given gene is often not sufficiently expressed in all cell types. So it is not evaluated in a subsets of cell types, and its coefficient value is NA
. Since mashr assumes coefficients and standard errors for every gene and cell type pair, entries with these missing values are set to have coef = 0
, and se = 1e6
. The output of mashr is then modified to set the corresponding values to NA
, to avoid nonsensical results downstream.
a list storing the mashr
model as model
and the original coefficients as logFC.original
mashr::mash_estimate_corr_em()
, mashr::cov_canonical
, mashr::mash_set_data
library(muscat)
library(mashr)
library(SingleCellExperiment)
data(example_sce)
# create pseudobulk for each sample and cell cluster
pb <- aggregateToPseudoBulk(example_sce[1:100, ],
assay = "counts",
cluster_id = "cluster_id",
sample_id = "sample_id",
verbose = FALSE
)
# voom-style normalization
res.proc <- processAssays(pb, ~group_id)
# Differential expression analysis within each assay,
# evaluated on the voom normalized data
res.dl <- dreamlet(res.proc, ~group_id)
# run MASH model
# This can take 10s of minutes on real data
# This small datasets should take ~30s
res_mash <- run_mash(res.dl, "group_idstim")
# extract statistics from mashr model
# NA values indicate genes not sufficiently expressed
# in a given cell type
# original logFC
head(res_mash$logFC.original)
# posterior mean for logFC
head(get_pm(res_mash$model))
# how many gene-by-celltype tests are significant
# i.e. if a gene is significant in 2 celltypes, it is counted twice
table(get_lfsr(res_mash$model) < 0.05, useNA = "ifany")
# how many genes are significant in at least one cell type
table(apply(get_lfsr(res_mash$model), 1, min, na.rm = TRUE) < 0.05)
# how many genes are significant in each cell type
apply(get_lfsr(res_mash$model), 2, function(x) sum(x < 0.05, na.rm = TRUE))
# examine top set of genes
# which genes are significant in at least 1 cell type
sort(names(get_significant_results(res_mash$model)))[1:10]
# Lets examine ENO1
# There is a lot of variation in the raw logFC
res_mash$logFC.original["ENO1", ]
# posterior mean after borrowing across cell type and genes
get_pm(res_mash$model)["ENO1", ]
# forest plot based on mashr results
plotForest(res_mash, "ENO1")
# volcano plot based on mashr results
# yaxis uses local false sign rate (lfsr)
plotVolcano(res_mash)
# Comment out to reduce package runtime
# gene set analysis using mashr results
# library(zenith)
# go.gs = get_GeneOntology("CC", to="SYMBOL")
# df_gs = zenith_gsa(res_mash, go.gs)
# Heatmap of results
# plotZenithResults(df_gs, 2, 1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.