se_contrast_stats: Compute contrast statistics on SummarizedExperiment data
In jmw86069/jamses: Jam SummarizedExperiment Stats

se_contrast_stats

R Documentation

Compute contrast statistics on SummarizedExperiment data

Description

Compute contrast statistics on SummarizedExperiment data

Usage

se_contrast_stats(
  se,
  assay_names,
  adjp_cutoff = 0.05,
  p_cutoff = NULL,
  fold_cutoff = 1.5,
  int_adjp_cutoff = adjp_cutoff,
  int_p_cutoff = p_cutoff,
  int_fold_cutoff = fold_cutoff,
  mgm_cutoff = NULL,
  ave_cutoff = NULL,
  confint = FALSE,
  floor_min = NULL,
  floor_value = NULL,
  sedesign = NULL,
  icontrasts = NULL,
  idesign = NULL,
  igenes = NULL,
  isamples = NULL,
  enforce_design = TRUE,
  use_voom = FALSE,
  voom_block_twostep = TRUE,
  posthoc_test = c("none", "DEqMS"),
  posthoc_args = list(DEqMS = list(PSM_counts = NULL, fit.method = "loess")),
  weights = NULL,
  robust = FALSE,
  handle_na = c("full1", "full", "partial", "all", "none"),
  na_value = 0,
  rowData_colnames = c("SYMBOL"),
  collapse_by_gene = FALSE,
  block = NULL,
  correlation = NULL,
  max_correlation_rows = 10000,
  normgroup = NULL,
  do_normgroups = TRUE,
  seed = 123,
  verbose = FALSE,
  ...
)

Arguments

`se`	`SummarizedExperiment` object. Note that `colnames(se)` should match the samples in `sedesign`, or the `rownames(idesign)` when `sedesign` is not supplied. Data is subset by `colnames(se)` using `isamples` when supplied, and `rownames(se)` when `igenes` is supplied. Note argument `rowData_colnames` can be used to retain some `rowData(se)` columns in the stat `data.frame` summaries for convenience, particularly helpful when analyzing microarray data where the `rownames(se)` represent probe ID or assay ID.
`assay_names`	`character` vector with one or more assay names from `names(assays(se))`.
`adjp_cutoff`	`numeric` value threshold with the adjusted P-value at or below which a contrast result can be considered a statistical hit. This threshold is applied in addition to other `cutoff` values.
`p_cutoff`	`numeric` value threshold with the unadjusted P-value at or below which a contrast result can be considered a statistical hit. This argument is not recommended, in favor of using `adjp_cutoff`. This threshold is applied in addition to other `cutoff` values.
`fold_cutoff`	`numeric` value threshold indicating the normal absolute fold change at or above which a contrast result can be considered a statistical hit. For clarity, this threshold is normal space fold change, for example 2-fold would be `fold_cutoff=2`.
`int_adjp_cutoff`, `int_p_cutoff`, `int_fold_cutoff`	optional thresholds used only when a two-way interaction style contrast is detected. These optional thresholds may be useful to apply more lenient criteria to interaction contrasts, but in that event are cautioned to be used for data mining exercises. Ideally, the thresholds are identical between pairwise and interaction contrasts, and ideally there are enough replicates in the data to support the interaction contrasts with sufficient confidence to make those comparisons.
`mgm_cutoff`	`numeric` value threshold of the maximum group mean value required in each contrast for the contrast to be considered a statistical hit. The "max group mean" logic requires only one group in a contrast to be above this threshold, while all other groups can have values below the threshold. This threshold is applied in addition to other `cutoff` values.
`ave_cutoff`	`numeric` value threshold of the average expression as reported by `limma::lmFit()`, within each normgroup if relevant, for the contrast to be considered a statistical hit. This threshold is applied in addition to other `cutoff` values. Typically the column "AvgExpr" is calculated as a row mean.
`confint`	`logical` passed to `limma::topTable()` indicating whether to calculate 95% confidence intervals for log2 fold change `logFC` values. Alternatively it can be a `numeric` value between zero and one specifying a specific confidence interval.
`floor_min`, `floor_value`	`numeric` minimum value (floor_min) at or below which `numeric` values in the assay data matrix are reverted to `floor_value` as a replacement. This option is valuable to set all `numeric` values at or below zero to zero, or to set all values at or below zero to `NA` in circumstances where zero indicates "no measurement" and would be more accurately represented as a missing measurement than a measurement of `0`. Note: `floor_min` is applied before `handle_na`, so that this step is able to create `NA` values which will be handled appropriately based upon the option `handle_na`.
`sedesign`	`SEDesign` object as defined by `groups_to_sedesign()`, with slotNames `"design"`, `"contrasts"`, and `"samples"`. The arguments `idesign` and `icontrasts` are ignored when this argument is defined.
`icontrasts`, `idesign`	`numeric` matrices representing statistical contrasts, and sample-group design matrices. These values are ignored when `sedesign` is defined.
`igenes`	`character` vector with optional subset of `rownames(se)`, by default it uses all `rownames(se)`.
`isamples`	`character` vector with optional subset of `colnames(se)`, by default it uses `colnames(se)` that are also defined in the design matrix.
`enforce_design`	`logical` (this option is not implemented). By default the design matrix is used to subset the input `colnames(se)` as needed, and `isamples` is used to subset the design matrix and corresponding contrasts as relevant.
`use_voom`	`logical` indicating whether to apply `limma-voom` analysis steps. When applied, data is not scaled using `limma:voom()`, instead uses data as supplied.
`voom_block_twostep`	`logical` indicating whether to perform the "two-step" voom methodology when `block` is also supplied. This workflow is recommended by voom authors: call `voom()` first without `block` to determine overall `weights` call `duplicateCorrelation()` if necessary, using the voom `weights`, and the `block` argument, to calculate `correlation`. call `voom()` again using the `correlation`, `weights`, and `block` arguments. This produces improved `weights`. call `duplicateCorrelation()` again with the updated `weights`, and `block` in order to calculate improved `correlation`. Then proceed with `lmFit()` using appropriate `weights` using `block`; and appropriate `correlation` also using the proper `weights` and `block`.
`posthoc_test`	`character` string indicating an optional post-hoc test to apply. `"none"`: applies `limma::eBayes()` by default, the moderated t-test. `"DEqMS"`: applies adjustment for proteomics measurements provided by the package `"DEqMS"`. See `posthoc_args`.
`posthoc_args`	`list` named by the `posthoc_test` above. `"DEqMS"` recognizes two arguments, which are passed to `DEqMS::spectraCounteBayes()`: `"PSM_counts"`: a `numeric` vector of peptide spectra matched, one per `igenes` or `rownames(se)`. These values are used by DEqMS to model variability based upon the number of spectra as the key measure of confidence. `"fit.method"`: `character` name of the model to use, default is `"loess"`.
`weights`	`numeric` non-negative precision weights passed to `limma::lmFit()`, either as a matrix with nrow and ncol that match `igenes` and `isamples`, respectively, or matching one of `length(igenes)` or `length(isamples)`. When `igenes` or `isamples` are not supplied, it uses `nrow(se)` or `ncol(se)`, respectively.
`robust`	`logical` passed to `limma::eBayes()`, whether estimation of `df.prior` and `var.prior` should be robust against outlier sample variances.
`handle_na`	`character` string indicating how to handle `NA` values in the data matrix, passed to `handle_na_values()`. `"partial"`: Replace `NA` values with `na_value`, except when an entire group is `NA` the entire group is kept at `NA`. `"full"`: Retain `NA` values, except when an entire group is `NA` the replicate values are replaced with `na_value`. The option `"full1"` may be preferred, since it only replaces one value in the group, therefore does not misrepresent the variability of the group as zero. `"full1"`: Retain `NA` values, except when an entire group is `NA`, one replicate value in the group is replaced with `na_value`. By replacing only one replicate with `na_value` the group does not have a variance/dispersion, forcing the variance to be determined by the other group in the contrast, while still allowing an approximate fold change to be calculated. This option is suitable when there is a noise floor above zero, as it retains an approximate fold change while estimating a P-value using only the variance derived from the non-`NA` group. `"all"`: replace all `NA` values with `na_value`. `"none"`: perform no replacement of `NA` values.
`na_value`	`numeric` passed to `handle_na_values()`
`rowData_colnames`	`character` vector of colnames in `rowData(se)` that should be retained in each stat `data.frame` produced as output. The values in `rowData_colnames` are intersected with `colnames(rowData(se))` to determine which columns to keep.
`collapse_by_gene`	`logical` (not currently implemented).
`block`	`character`, `factor`, or `numeric` used as a blocking factor, using argument `block` in `limma::lmFit()` for example. The argument can be supplied as one of the following: `character` with one or more `colnames(colData(se))` `character` vector named using values in `isamples` and/or `colnames(se)` `character` vector without names, whose length must equal `length(isamples)` or when `isamples` is not provided, `ncol(se)`.
`correlation`	optional inter-duplicate or inter-technical correlation matrix passed to `limma::lmFit()`.
`max_correlation_rows`	`numeric` maximum number of rows in `imatrix` to use when calling `limma::duplicateCorrelation()`. This process only occurs when `block` is defined, `correlation=NULL` and `nrow(imatrix) > max_correlation_rows`. In this scenario, a random subset of rows are used to calculate `correlation`, then that `correlation` is used for `limma::lmFit()`. This process is intended to help very large data volumes, where the speed of `limma::duplicateCorrelation()` is impacted in quadratic manner by increasing number of rows, while also not improving the summary correlation value.
`normgroup`	`character` or `factor` vector with length `ncol(se)` or `length(isamples)` when `isamples` is defined. Values define independent normalization groups, which performs independent analyses within each unique normalization group. This option is intended for convenience, enabling separate variance models for each normalization group, which is appropriate when analyzing very different sample types. During limma model fit, all samples in all groups are used by default, which may incorrectly estimate variance when the variability by row is not uniform across different sample types. When `normgroup=NULL` the default is to assume all samples are in the same `normgroup="bulk"`. Each subset of samples begins with the same `sedesign`, `idesign`, `icontrast`, however they are fed into `validate_sestats()` to subset the appropriate contrasts to use only samples within the current normgroup. As a result, any contrasts that span two normgroups will be removed, and will not appear in the output. The argument can be supplied as one of the following: `character` with one or more `colnames(colData(se))` `character` vector named using values in `isamples` and/or `colnames(se)` `character` vector without names, whose length must equal `length(isamples)` or when `isamples` is not provided, `ncol(se)`.
`do_normgroups`	`logical` whether to enable normgroup processing, or to use the previous technique that kept all samples together. This argument may be removed in future, with recommendation to use `normgroup=NULL` to disable normalization group processing. Note that when `normgroup=NULL` the output should be identical with `do_normgroups=TRUE` and `do_normgroups=FALSE`.
`seed`	`numeric` used to set a random seed with `set.seed()` for reproducibility. Use `seed=NULL` to avoid setting a random seed. Note this action will only affect downstream functions that employ some form of randomization.
`verbose`	`logical` indicating whether to print verbose output.
`...`	additional arguments are ignored.

Details

This function is essentially a wrapper around statistical methods in the limma package, with additional steps to apply statistical thresholds to define "statistical hits" by three main criteria:

P-value or adjusted P-value
fold change
max group mean

This function is unique in that it applies the statistical methods to one or more "signals" in the input SummarizedExperiment assays, specifically intended to compare things like normalization methods.

If multiple statistical thresholds are defined, each one is applied in order, which is specifically designed to compare the effect of applying different statistical thresholds. For example one may want to pre-compute "statistical hits" using adjusted P-value 0.05, and 0.01; or using fold change >= 1.5, or fold change >= 2.0. The underlying statistics are the same, but a column indicating hits is created for each set of thresholds.

Hits are annotated:

-1 for down-regulation
0 for un-changed (by the given criteria)
1 for up-regulation

The results are therefore intended to feed directional Venn diagrams, which display the overlapping gene hits, and whether the directions are shared or opposed.

This function can optionally apply the limma-voom workflow, which involves calculating matrix weights using limma::voom(), then applying those weights during the model fit.

The output is intended to include several convenient formats:

stats_dfs - list of data.frame stats one per contrast
stats_df - one data.frame with all stats together
hit_array - array with three dimensions: signal; contrast; threshold whose cells contain hit flags (-1, 0, 1) named by rownames(se).

Design and contrast matrices can be defined using the function jamses::groups_to_sedesign(). That function assigns each sample to a sample group, then assembles all relevant group contrasts which involve only one-factor contrast at a time. It optionally defines two-factor contrasts (contrast of contrasts) where applicable.

A subset of genes (rownames(se)) or samples (colnames(se)) can be defined, to restrict calculations to use only the subset data.

Examples

set.seed(123)
expr <- rnorm(20) + 7;
noise <- rnorm(120) / 5;
fold <- rnorm(20) / 2.5;
m <- matrix(expr + noise, ncol=6);
for (i in 4:6) {
   m[,i] <- m[,i] + fold;
}
# m <- matrix(rnorm(120)/2 + 5, ncol=6);
colnames(m) <- paste0(rep(c("Vehicle", "Treated"), each=3), 1:3)
rownames(m) <- paste0("row", seq_len(20))
# simulate some "hits"
m[1, 4:6] <- m[1, 4:6] + 2
m[2, 4:6] <- m[2, 4:6] + 1.5
m[3, 4:6] <- m[3, 4:6] - 1.3
# create SummarizedExperiment
se <- SummarizedExperiment::SummarizedExperiment(
   assays=list(counts=m),
   colData=data.frame(sample=colnames(m),
   group=factor(rep(c("Vehicle", "Treated"), each=3),
      c("Vehicle", "Treated"))),
   rowData=data.frame(measurement=rownames(m)))

# assign colors
sample_color_list <- platjam::design2colors(se)

# heatmap
heatmap_se(se, sample_color_list=sample_color_list)

# create SEDesign
sedesign <- groups_to_sedesign(se, group_colnames="group")

# plot the design and contrasts
plot_sedesign(sedesign)

# limma contrasts
sestats <- se_contrast_stats(se=se,
   sedesign=sedesign,
   assay_names="counts")

# print data.frame summary of hits
sestats_to_df(sestats)

# plot the design with number of hits labeled
plot_sedesign(sedesign, sestats=sestats)

# heatmap with hits
heatmap_se(se,
   sample_color_list=sample_color_list,
   sestats=sestats)

# review stats table
stats_df <- sestats$stats_dfs$counts[["Treated-Vehicle"]]
head(stats_df)

# volcano plot for one contrast
jamma::volcano_plot(stats_df)

###############################
# simulate a two-way model
adjust <- rnorm(20);
new_fold <- rnorm(20);
se2 <- cbind(se, se)
groups2 <- paste0(rep(c("WT", "KO"), each=6),
   "_",
   SummarizedExperiment::colData(se)$group)
groups2 <- factor(groups2, levels=unique(groups2))
colnames(se2) <- paste0(groups2, 1:3)
SummarizedExperiment::colData(se2)$sample <- colnames(se2);
SummarizedExperiment::colData(se2)$group <- groups2;
SummarizedExperiment::colData(se2)$genotype <- jamba::gsubOrdered("_.+", "",
   SummarizedExperiment::colData(se2)$group)
SummarizedExperiment::colData(se2)$treatment <- jamba::gsubOrdered("^.+_", "",
   SummarizedExperiment::colData(se2)$group)
SummarizedExperiment::assays(se2)$counts[,7:12] <- (
   SummarizedExperiment::assays(se2)$counts[,7:12] + adjust)
SummarizedExperiment::assays(se2)$counts[,10:12] <- (
   SummarizedExperiment::assays(se2)$counts[,10:12] + new_fold)

# assign colors
sample_color_list2 <- platjam::design2colors(se2,
   class_colnames="genotype",
   color_sub=c(
      Vehicle="palegoldenrod",
      Treated="firebrick4"),
   group_colnames="group")

# heatmap
hm2a <- heatmap_se(se2,
   sample_color_list=sample_color_list2,
   controlSamples=1:3,
   center_label="versus WT_Vehicle",
   column_title_rot=60,
   row_cex=0.4,
   column_cex=0.5,
   column_title_gp=grid::gpar(fontsize=8),
   column_split="group")
ComplexHeatmap::draw(hm2a, column_title=attr(hm2a, "hm_title"), merge_legends=TRUE)

# heatmap centered by genotype
hm2b <- heatmap_se(se2,
   sample_color_list=sample_color_list2,
   controlSamples=c(1:3, 7:9),
   control_label="versus vehicle",
   column_gap=grid::unit(c(1, 3, 1), "mm"),
   centerby_colnames="genotype",
   column_title_rot=60,
   row_cex=0.4,
   column_cex=0.5,
   column_title_gp=grid::gpar(fontsize=8),
   column_split="group")
ComplexHeatmap::draw(hm2b, column_title=attr(hm2b, "hm_title"), merge_legends=TRUE)

# create SEDesign
sedesign2 <- groups_to_sedesign(se2, group_colnames="group")

# plot the design and contrasts
plot_sedesign(sedesign2, label_cex=0.7)
# note the two-way contrast can be flipped
plot_sedesign(sedesign2, flip_twoway=TRUE, label_cex=0.7)

# limma contrasts
sestats2 <- se_contrast_stats(se=se2,
   sedesign=sedesign2,
   assay_names="counts")

# print data.frame summary of hits
sestats_to_df(sestats2)

# plot the design with number of hits labeled
plot_sedesign(sedesign2, sestats=sestats2, label_cex=0.7)

# heatmap with hits
hm2hits <- heatmap_se(se2,
   sample_color_list=sample_color_list2,
   sestats=sestats2,
   controlSamples=c(1:3, 7:9),
   column_gap=grid::unit(c(1, 3, 1), "mm"),
   control_label="versus vehicle",
   centerby_colnames="genotype",
   row_split=3,
   row_cex=0.4,
   column_title_rot=60,
   column_cex=0.5,
   column_title_gp=grid::gpar(fontsize=8),
   column_split="group")
ComplexHeatmap::draw(hm2hits,
   column_title=attr(hm2hits, "hm_title"),
   merge_legends=TRUE)

# venn diagram
hit_list <- hit_array_to_list(sestats2,
   contrast_names=c(1:2))
jamba::sdim(hit_list);
# names(hit_list) <- gsub(":", ",<br>\n", contrast2comp(names(hit_list)))
# vo <- venndir::venndir(hit_list, expand_fraction=0.1)
# venndir::venndir_legender(venndir_out=vo, setlist=hit_list)

# vo <- venndir::venndir(hit_list, expand_fraction=0.1, proportional=TRUE)
# venndir::venndir_legender(venndir_out=vo, setlist=hit_list)

jmw86069/jamses documentation built on Nov. 4, 2024, 9:25 p.m.